Analysis of 10x genomics multiome data using sincei

Below, we will demonstrate how we can use sincei to explore the scRNA-seq and scATAC-seq data as part of the 10x multiome protocol. The 10x multiome kit allows joint profiling of single-cell ATAC-seq and RNA-seq from single-cells. Here, we will analyse these two data sets separately. We will will use the dataset published with Persad et. al. (2023), which profiles CD34+ cells from human bone marrow.

1. Download and process the dataset

The raw fastq files were downloaded from GEO and processed using the standard 10x genomics cellranger-arc workflow. Below is the structure of the output directory from the workflow:

<output_di>/outs:
├── analysis
├── atac_cut_sites.bigwig
├── atac_fragments.tsv.gz
├── atac_fragments.tsv.gz.tbi
├── atac_peak_annotation.tsv
├── atac_peaks.bed
├── atac_possorted_bam.bam
├── atac_possorted_bam.bam.bai
├── cloupe.cloupe
├── filtered_feature_bc_matrix
├── filtered_feature_bc_matrix.h5
├── gex_molecule_info.h5
├── gex_possorted_bam.bam
├── gex_possorted_bam.bam.bai
├── per_barcode_metrics.csv
├── raw_feature_bc_matrix
├── raw_feature_bc_matrix.h5
├── summary.csv
└── web_summary.html

We will use the gex_possorted_bam.bam for gene-expression data and atac_possorted_bam.bam for chromatin accessibility analysis using sincei. These files can also be produced as part of the cellranger count workflow for scRNA-seq or scATAC-seq data alone. For convenience, we provide a subset of this data (only chromosome 2) here

mkdir 10x_multiome && wget -O 10x_multiome/10x_multiome_testdata.tar.gz https://figshare.com/ndownloader/files/41303289
tar -xvzf 10x_multiome/10x_multiome_testdata.tar.gz ## releases 7 files

(optional) pre-filtering of barcodes

Most of the cell barcodes from the droplet-based protocols (like 10x genomics) do not contain cells. Therefore they have very low counts. These must be filtered away at the beginning of the analysis. Although the cellranger pipeline already provides a list of filtered barcodes, sincei also allows you to extract per barcode count distributions, indicating which barcodes should be removed. This can be done using the scFilerBarcodes tool.

barcodes=737K-arc-v1.txt # cellranger-arc barcodes in this case
for r in 1 2
do
    bamfile=cellranger_output_rep${r}/outs/atac_possorted_bam.bam
    scFilterBarcodes -p 20 -b ${bamfile} -w ${barcodes} \
    -o sincei_output/atac_barcodes_rep${r}.tsv \
    --minCount 100 --minMappingQuality 10 --cellTag CB \
    --rankPlot sincei_output/barcode_rankplot_rep${r}.png
done

The above example uses a whitelist of possible ATAC barcodes from cellrange-arc workflow. See here for more details. Providing a whitelist is optional in general, but recommended for 10x genomics data.

The output file contains a list of filtered barcodes that contain counts in atleast -mc regions of the genome. Unlike other tools with similar options, sincei splits the data in 100kb bins and reports whether or not a barcode has signal in those bins. This way, barcodes with high counts, but present in only one genomic bin can also be filtered out. In most cases, the output is same as the usual approach of filtering by total counts. -rp produces the familiar knee-plot of the barcode counts.

2. scATAC-seq analysis

Please follow this tutorial for further analysis of scATAC-seq samples from the above data.

3. scRNA-seq analysis

Please follow :doc: this tutorial <sincei_tutorial_10xATAC> for further analysis of scRNA-seq samples from the above data.

4. Joint analysis

Tutorial in preparation.