Analysis of 10x genomics multiome data using sincei#

Below, we demonstrate how we can use sincei to explore the scRNA-seq and scATAC-seq data from the 10x multiome protocol. The 10x multiome kit allows joint profiling of single-cell ATAC-seq and RNA-seq from single-cells. Here, we will analyze these two data sets separately. We will use the dataset published in Persad et. al. (2023), which profiles CD34+ cells from human bone marrow.

1. Download and process the dataset#

The raw fastq files were downloaded from GEO using sra-tools and processed using the standard 10x genomics cellranger-arc workflow.

# Download using sra-tools
prefetch --force all --max-size 10t \
   -O GSM6005302_CD34_Rep1_Multiome_RNA_Homo_sapiens_RNA-Seq \
   SRR18590099 SRR18590100
cd GSM6005302_CD34_Rep1_Multiome_RNA_Homo_sapiens_RNA-Seq
vdb-validate SRR18590099 SRR18590100

fasterq-dump -m 10000MB -b 100MB -c 1000MB \
    --temp /<path>/<to>/tempdir \
    -e 20 -p --split-files \
    --include-technical \
    -O . SRR18590099 SRR18590100

# Run cellranger-arc
for rep in rep1 rep2
do
    cellranger-arc count --disable-ui --reference \
        /<path>/<to>/cellranger_indices/refdata-cellranger-arc-GRCh38-2020-A-2.0.0/ \
        --localcores 20 --localmem 200 --id output_${rep} \
        --libraries cellranger_samplesheet_${rep}.csv
done

Below is the structure of samplesheet that was used for cellranger:

fastqs,sample,library_type
/path/to/10x_multiome/fastq,CD34_Rep1_Multiome_ATAC,Chromatin Accessibility
/path/to/10x_multiome/fastq,CD34_Rep1_Multiome_RNA,Gene Expression

Below is the structure of the output directory from the workflow:

<output_di>/outs:
├── analysis
├── atac_cut_sites.bigwig
├── atac_fragments.tsv.gz
├── atac_fragments.tsv.gz.tbi
├── atac_peak_annotation.tsv
├── atac_peaks.bed
├── atac_possorted_bam.bam
├── atac_possorted_bam.bam.bai
├── cloupe.cloupe
├── filtered_feature_bc_matrix
├── filtered_feature_bc_matrix.h5
├── gex_molecule_info.h5
├── gex_possorted_bam.bam
├── gex_possorted_bam.bam.bai
├── per_barcode_metrics.csv
├── raw_feature_bc_matrix
├── raw_feature_bc_matrix.h5
├── summary.csv
└── web_summary.html

We will use the gex_possorted_bam.bam for gene-expression data and atac_possorted_bam.bam for chromatin accessibility analysis using sincei. These files can also be produced as part of the cellranger count workflow for scRNA-seq or scATAC-seq data alone. For convenience, we provide a subset of this data here.

mkdir 10x_multiome_testdata
cd 10x_multiome_testdata
wget -O 10x_multiome_testdata.tar.gz https://figshare.com/ndownloader/files/60325697
tar -xvzf 10x_multiome_testdata.tar.gz ## releases 4 (indexed) bam files, 2 metadata files and 1 bed file.

rm 10x_multiome_testdata.tar.gz & cd ../ #cleanup

(optional) pre-filtering of barcodes#

Most of the cell barcodes from droplet-based protocols (like 10x genomics) do not contain cells. Therefore, they have very low counts. These must be filtered away at the beginning of the analysis. Although the cellranger pipeline already provides a list of filtered barcodes, sincei also allows you to extract per barcode count distributions, indicating which barcodes should be removed. This can be done using the scFilterBarcodes tool.

mkdir -p sincei_output/atac sincei_output/rna
mkdir -p sincei_output/atac sincei_output/atac

barcodes=10x_barcodes_tutorial.txt
for rep in rep1 rep2
do
    # scATAC-seq
    bamfile=cellranger_output_${rep}_atac_possorted_bam.bam
    scFilterBarcodes -p 8 -b ${bamfile} \
    -o sincei_output/atac/atac_barcodes_${rep}.tsv \
    --minCount 100 --minMappingQuality 10 --cellTag CB \
    --rankPlot sincei_output/atac/barcode_rankplot_atac_${rep}.png

    # filter passing ATAC barcodes
    awk '$4 == "True" { print $2 }' sincei_output/atac/atac_barcodes_${rep}.tsv \
        > sincei_output/atac/atac_barcodes_${rep}_filtered.txt

    # scRNA-seq
    bamfile=cellranger_output_${rep}_gex_possorted_bam.bam
    scFilterBarcodes -p 8 -b ${bamfile} \
        -o sincei_output/rna/rna_barcodes_${rep}.tsv \
        --minCount 100 --minMappingQuality 10 --cellTag CB \
        --rankPlot sincei_output/rna/barcode_rankplot_rna_${rep}.png

    # filter passing RNA barcodes
    awk '$4 == "True" { print $2 }' sincei_output/rna/rna_barcodes_${rep}.tsv \
        > sincei_output/rna/rna_barcodes_${rep}_filtered.txt

done

The above example uses a subset of the whitelist of possible barcodes from the cellranger-arc workflow. See here for more details. Providing a whitelist is optional in general, but recommended for 10x genomics data and other droplet protocols in general.

The output file is a .tsv file with log10 barcode counts on each barcode and whether it contains counts in at least --minCount regions of the genome. Unlike other tools with similar options, sincei splits the data in 100kb bins and reports whether or not a barcode has signal in those bins. This way, barcodes with high counts, but present in only one genomic bin can also be filtered out. In most cases, the output is same as the usual approach of filtering by total counts. The --rankPlot argument produces the familiar knee-plot of barcode counts.

We need a single-column file with the barcodes for later steps in the analysis. We can filter the barcodes that pass the filtering criteria using simple command-line tools like awk as shown above.

2. scATAC-seq analysis#

Follow this tutorial to learn how to analyze the scATAC-seq samples of the data we just processed

3. scRNA-seq analysis#

Follow this tutorial to learn how to analyze the scRNA-seq samples of the data we just processed.

4. Combined analysis of scATAC-seq and scRNA-seq data#

After analyzing the scATAC-seq and scRNA-seq data separately, we can combine the two data modalities for a joint analysis. scCombineCounts can be used to combine the AnnData objects produced in the two tutorials above into a MuData object that contains both modalities.

scCombineCounts \
-i sincei_output/atac/scCounts_atac_peaks_clustered.h5ad \
sincei_output/rna/scCounts_rna_genes_clustered.h5ad \
-o sincei_output/scCounts_10x_multiome_clustered.h5mu \
--method multi-modal \
--labels atac rna

This object can be used for further processing with other software like muon. ..

This object can then be used with our modules/multimodalClustering python module to perform joint clustering of the two data types. Follow this tutorial to learn how to perform joint clustering of multimodal data using sincei.

Notes#

Currently, sincei doesn't provide a method for doublet estimation and removal, which is an important step in the analysis of droplet-based data. Instead, we use simpler filters of min and max number of detected features per cell, which, to some extent mitigates this issue. However, this could lead to some differences in results compared to the published data in used here. Despite this difference, the major published cell types can be separated with sincei for both ATAC and RNA fraction of the data, as shown in the 2 tutorials above.