Analysis of 10x genomics scATAC-seq data using sincei#

The data used here is produced in our 10x genomics multiome tutorial.

We will use the BAM files, barcodes and peaks.bed file obtained from the cellranger-arc workflow. Alternatively, the BAM file (tagged with cell barcodes) and peaks.bed can be obtained from a custom mapping/peak calling workflow, and the list of filtered cell barcodes can be obtained using sincei (see the parent tutorial for explanation).

Define common bash variables:

cd 10x_multiome_testdata
# create dir
mkdir sincei_output/atac

Additionally, we will need a blacklist file to filter out reads aligned to known problematic regions of the human genome. This file and blacklists for other genome assemblies can be downloaded from the Boyle lab ENCODE blacklist repository.

wget -O hg38-blacklist.v2.bed.gz https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz

2. Quality control - I (read-level)#

In order to identify high quality cells for our analysis, we can use the read-level quality statistics from scFilterStats. Low quality cells in this data can be identified using several criteria, such as:

  • high number of PCR duplicates (filtered using --duplicateFilter)

  • high fraction of reads aligned to blacklisted regions (filtered using --blacklist)

  • high fraction of reads with poor mapping quality (filtered using --minMappingQuality)

  • vey high/low GC content of the aligned reads, indicating the reads were mostly aligned to low-complexity regions (filtered using --GCcontentFilter)

  • high level of secondary/supplementary alignments (filtered using --samFlagExclude/Include)

for rep in rep1 rep2
do
   bamfile=cellranger_output_${rep}_atac_possorted_bam.bam
   barcodes=sincei_output/atac/atac_barcodes_filtered_${rep}.txt # from sincei or cellranger output
   blacklist=hg38-blacklist.v2.bed

   scFilterStats -p 8 \
      --region chr1 \
      --GCcontentFilter '0.2,0.8' \
      --minMappingQuality 10 \
      --samFlagInclude 64 \
      --samFlagExclude 256 \
      --samFlagExclude 2048 \
      --blacklist ${blacklist}  \
      --barcodes ${barcodes} \
      --cellTag CB \
      --label atac_${rep} \
      -o sincei_output/atac/scFilterStats_output_${rep}.tsv \
      -b ${bamfile}
done

scFilterStats summarizes these outputs as a table, which can then be visualized using the MultiQC tool, to select appropriate list of cells to include for counting.

multiqc sincei_output/atac/ # results in multiqc_report.html

3. Signal aggregation (counting reads)#

Below we aggregate the signal from single-cells, using the atac_peaks.bed file, which contains regions of local signal enrichment detected using the cellranger-arc workflow.

Since we get a seperate peaks file from each replicate, we can run sincei on the union of the peaks detected on each sample.

If needed, we can use the same parameters as in scFilterStats to count only high quality reads from our whitelist of barcodes. We avoid counting reads in blacklisted regions of the human genome.

for rep in rep1 rep2
do
   bamfile=cellranger_output_${rep}_atac_possorted_bam.bam
   barcodes=sincei_output/atac/atac_barcodes_filtered_${rep}.txt # from sincei or cellranger output
   peaks_bed=atacPeaks_replicateMerged.bed

   scCountReads features -p 8 \
      --region chr1 \
      --BED ${peaks_bed} \
      --minMappingQuality 10 \
      --samFlagInclude 64 \
      --samFlagExclude 2048 \
      --extendReads \
      --blacklist ${blacklist} \
      --barcodes ${barcodes} \
      --cellTag CB \
      -o sincei_output/atac/scCounts_atac_peaks_${rep} \
      --label atac_${rep} \
      -b ${bamfile}
done
# Number of features found: 33740

4. Quality control - II (count-level)#

After counting, it is recommended to perform QC on these counts, in order to filter regions and cells that have low counts, or have low enrichment of counts. Even though we already performed read-level QC before, the counts distribution on our specified regions (bins/genes/peaks) could be different from the whole-genome stats.

We can run scCountQC on the count data to get various statistics at region and cell level. Running this tool with the --describe flag lists the metrics that can be used to filter cells/regions.

# list the metrics we can use to filter cells/regions
for rep in rep1 rep2
do
   scCountQC -i sincei_output/atac/scCounts_atac_peaks_${rep}.h5ad --describe
done

The tool scCountQC can be used for count-level QC and filtering of count data. With the --outMetrics option, the tool outputs the count statistics at region and cell level (labelled as <prefix>.regions.tsv and <prefix>.cells.tsv). Just like scFilterStats, these outputs can then be visualized using the MultiQC tool, to select appropriate metrics to filter out the unwanted cells/regions.

# export the single-cell level metrics
for rep in rep1 rep2
do
   scCountQC -i sincei_output/atac/scCounts_atac_peaks_${rep}.h5ad -om sincei_output/atac/countqc_atac_peaks_${rep}
done

# visualize output using multiQC
multiqc sincei_output/atac/ # see results in multiqc_report.html

In this example, we detect ~18.5k peaks in ~13.5K cells.

Below, we perform some basic filtering using scCountQC. We exclude the cells with <500 and >10000 detected peaks (--filterRegionArgs). We also exclude peaks detected in too few cells (<100) or in >90% of cells (--filterCellArgs).

for rep in rep1 rep2
do
   scCountQC -i sincei_output/atac/scCounts_atac_peaks_${rep}.h5ad \
      -o sincei_output/atac/scCounts_atac_peaks_filtered_${rep}.h5ad \
      -om sincei_output/atac/scCounts_atac_peaks_filtered_${rep} \
      --filterRegionArgs "n_cells_by_counts: 100, 5500" \
      --filterCellArgs "n_genes_by_counts: 200, 5000"
done

## rep 1
# Applying filters
# Cells post-filtering: 1092
# Features post-filtering: 7642
## rep 2
# Applying filters
# Cells post-filtering: 1038
# Features post-filtering: 7923

5. Combine counts for the 2 replicates#

While it is recommended to perform count QC separately for each replicate, the counts can now be combined into one file for more convenient for downstream analysis. We provide a tool scCombineCounts, which can concatenate counts for cells based on common features (in multi-sample mode). It can also be used to concatenate different modalities based on a common set of cells (in multi-modal mode).

Concatenating the filtered cells for the 2 replicates results in a total of ~12K cells.

scCombineCounts \
   -i sincei_output/atac/scCounts_atac_peaks_filtered_rep1.h5ad \
   sincei_output/atac/scCounts_atac_peaks_filtered_rep2.h5ad \
   -o sincei_output/atac/scCounts_atac_peaks_filtered.merged.h5ad \
   --method multi-sample \
   --labels rep1 rep2
# Combined cells: 2130
# Combined features: 7399

5. Dimensionality reduction and clustering#

The tool scClusterCells provides a range of options to reduce the dimensionality of our count data, while preserving biological signal. This can be specified with --method option. Below, we will use a topic modeling method called Latent Semantic Analysis (LSA) to reduce the dimensionality of our data to 30 topics (the default). The tool then performs Leiden clustering, and presents a UMAP (dimensionality reduction to 2 dimensions) plot of the output (--outFileUMAP option). This option also creates a tsv file with the UMAP coordinates and assigned cluster for each cell in our data.

scClusterCells -i sincei_output/atac/scCounts_atac_peaks_filtered.merged.h5ad \
   -m LSA \
   -n 30 \
   --clusterResolution 1 \
   -op sincei_output/atac/scClusterCells_UMAP.png \
   -o sincei_output/atac/scCounts_atac_peaks_clustered.h5ad
# Coherence Score:  -1.6229714432102043
# also produces the tsv file "sincei_output/scClusterCells_UMAP.tsv"

(optional) Confirmation of clustering using metadata#

Below, we load this data in R and compare it to the cell metadata provided with our files to verify that our clustering separates celltypes in a biologically meaningful way.

We can color our UMAP output from scClusterCells with the cell-type information from Persad et. al. (2023), that we provide on figshare.

Clustering validation (click for Python code)
import scanpy as sc
import pandas as pd
import matplotlib.pyplot as plt

atac_metadata = pd.read_csv('metadata_cd34_atac.csv', header=0, index_col=0)

atac_adata = sc.read_h5ad('sincei_output/atac/scCounts_atac_peaks_clustered.h5ad')
atac_adata.obs_names = atac_adata.obs_names.str.replace('rep1_|rep2_', '', regex=True)

atac_adata.obs = atac_adata.obs.merge(atac_metadata['celltype'], left_index=True, right_index=True, how='left')

# make plots
colors = plt.colormaps['Paired'].colors
colors_dict = {
   # leiden clusters
   '0': colors[1],
   '1': colors[5],
   '2': colors[0],
   '3': colors[2],
   '4': colors[4],
   '5': colors[3],
   '6': colors[6],
   # published cell types
   'CLP': colors[4],
   'Ery': colors[6],
   'HMP': colors[5],
   'HSC': colors[1],
   'MEP': colors[2],
   'Mono': colors[0],
   'cDC': colors[3],
   'pDC': colors[7],
   }

sc.pl.umap(
   atac_adata,
   color=['leiden', 'celltype'],
   palette=colors_dict,
   title=['sincei Clusters (LSA + Leiden)', 'Published Cell Types'],
   legend_fontsize=14,
   legend_loc='on data',
   frameon=False,
   size=60,
   )

for ax in plt.gcf().axes:
   ax.title.set_size(fontsize=16)

plt.savefig('sincei_output/atac/UMAP_compared_withOrig.png', dpi=300, bbox_inches='tight')
Clustering validation (click for R code)
library(dplyr)
library(magrittr)
library(ggplot2)
library(patchwork)

umap <- read.delim("sincei_output/atac/scClusterCells_UMAP.tsv")
meta <- read.csv("metadata_cd34_atac.csv", row.names = 1)
umap$celltype <- meta[gsub("rep1_|rep2_", "", umap$Cell_ID), "celltype"]

# keep only cells with published labels
umap %<>% filter(!is.na(celltype))

## make plots
df_center <- group_by(umap, cluster) %>% summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2))
df_center2 <- group_by(umap, celltype) %>% summarise(UMAP1 = mean(UMAP1), UMAP2 = mean(UMAP2))

# colors for metadata (8 celltypes)
col_pallette <- RColorBrewer::brewer.pal(8, "Paired")
names(col_pallette) <- unique(umap$celltype) # grey is for NA

# colors for sincei UMAP (5 clusters)
colors_cluster <- RColorBrewer::brewer.pal(5, "Paired")
names(colors_cluster) <- unique(umap$cluster)

p1 <- umap %>% ggplot(., aes(UMAP1, UMAP2, color=factor(cluster),
label=cluster)) + geom_point() +
geom_label(data = df_center, aes(UMAP1, UMAP2), fill = "white") +
scale_color_manual(values = colors_cluster) +
theme_void(base_size = 12) + theme(legend.position = "none") +
ggtitle("sincei Clusters (LSA + Leiden)")

p2 <- umap %>% filter(!is.na(celltype)) %>% ggplot(., aes(UMAP1, UMAP2,
color=factor(celltype), label=celltype)) + geom_point() +
geom_label(data = df_center2, aes(UMAP1, UMAP2), fill = "white") +
scale_color_manual(values = col_pallette) + labs(color="Cluster") +
theme_void(base_size = 12) + theme(legend.position = "none") +
ggtitle("Published Cell Types")


pl <- p1 + p2
pl
ggsave(plot=pl, "sincei_output/atac/UMAP_compared_withOrig.png", dpi=300, width = 11, height = 6)
../../_images/UMAP_compared_withOrig_10xATAC.png

The figure above shows that we can somewhat replicate the expected cell-type results from the scATAC data using sincei. This was done using only 1/23th of original data (chromosome 1) and basic pre-processing steps, therefore the results should only improve with full data, better cell/region filtering and optimizing the analysis parameters.

6. Creating bigwigs and visualizing signal on IGV#

For further exploration of data, it can be useful to create pseudo-bulk coverage files (bigwigs) that aggregate the signal across cells in our each of our clusters. The tool scBulkCoverage takes the clustering information .tsv file produced by scClusterCells, along with the corresponding BAM files, and aggregates the signal to create these bigwigs.

The parameters here are same as other sincei tools that work on BAM files, except that we can output normalized bulk signal (specified using --normalizeUsing option) . Below, we produce CPM-normalized bigwigs at 1kb resolution.

scBulkCoverage -p 8 \
   --cellTag CB \
   --normalizeUsing CPM \
   --binSize 1000 \
   --minMappingQuality 10 \
   --samFlagInclude 64 \
   --samFlagExclude 2048 \
   --duplicateFilter 'start_bc_umi' \
   --extendReads \
   -b cellranger_output_rep1_atac_possorted_bam.bam \
   cellranger_output_rep2_atac_possorted_bam.bam \
   --labels rep1_atac_rep1 rep2_atac_rep2 \
   -i sincei_output/atac/scClusterCells_UMAP.tsv \
   -o sincei_output/atac/sincei_cluster
# creates 5 files with names "sincei_cluster_<X>.bw" where X is 0, 1... 4

We can now inspect our bigwigs on IGV. We can clearly see some regions with cell-type specific signal, such as the marker TAL1 (erythroid) described in the original manuscript.

../../_images/igv_snapshot_10xATAC.png