Counting reads from bam files with ReadCounter#
Sincei provides the scCountReads command line tool to produce count matrices stored in AnnData format. This tool is based on the sincei.ReadCounter module which can be used directly in python.
This tutorial demonstrates how to use sincei.ReadCounter to produce read count matrices from bam files. We will use the example data provided in our sc-sortChIC analysis tutorial and our 10x multiome scRNA-seq analysis tutorial.
Both datasets can be downloaded from figshare: sc-sortChIC and 10x sc-RNAseq
You can download this tutorial as a jupyter notebook by clicking the icon on the top right bar.
The ReadCounter module contains a main class, CountReadsPerBin that takes the paths to our bamfiles as input, and returns a feature x cell count matrix in numpy.ndarray format.
[1]:
import os
from pathlib import Path
import anndata as ad
from sincei.ReadCounter import CountReadsPerBin
threads = 16
1. Counting reads in genomic bins#
First, we will produce a count matrix for a subset of the single-cell sortChIC H3K27me3 data from Zeller, Yueng et. al. (2023). We aggregate the signal in 10kb bins.
Our data is located in the sortchic_testdata directory. We can store the paths to our bamfiles and to the file containing our cell barcodes using the following commands. Note that the cell barcodes must be stored in a list of strings, where each string corresponds to one barcode.
Additionally, we need a blacklist of problematic regions in the genome to filter out sequencing artifacts. We provide a blacklist for the mm10 mouse genome in our example data.
[4]:
data_dir = 'sortchic_testdata'
bam_files = [bam_file.path for bam_file in os.scandir(data_dir) if bam_file.path.endswith('.bam')]
barcodes_path = os.path.join(data_dir, 'sortChIC_barcodes.txt')
barcodes = [line.strip() for line in open(barcodes_path)]
blacklist = os.path.join(data_dir, 'mm10_blacklist.bed')
CountReadsPerBin takes the following arguments:
The list of paths to the bamfiles
Optionally, a region to limit the counting to in the format
chrom:start:endThe length of the bins to aggregate the signal into
The step size between bins. This is the distance between the start of two consecutive bins. Set equal to
binLengthto have non-overlapping bins covering the whole region.The list of barcodes to count cells for, where each barcode corresponds to a cell in each sample.
The BAM tag where the cell barcodes are stored
The path to the blacklist BED/GTF file
The number of cores to use in the counting
We instantiate CountReadsPerBin with all the necessary parameters, and perform the counting by calling CountReadsPerBin.run(). This returns our bin x cell count matrix and numpy array of bin IDs.
Here we count the reads in 50kb bins across chromosome 1.
[5]:
cellTag = 'BC'
region = 'chr1'
bin_length = 50000
counter_bins = CountReadsPerBin(
bam_files,
region=region,
binLength=bin_length,
stepSize=bin_length,
barcodes=barcodes,
cellTag=cellTag,
blackListFileName=blacklist,
numberOfProcessors=threads,
)
bin_counts, bin_IDs = counter_bins.run()
[6]:
print(f"""
Counted {bin_counts.shape[1]} barcodes and {bin_counts.shape[0]} in chr1.
Total counts: {bin_counts.sum()}
""")
Counted 1536 barcodes and 3923 in chr1.
Total counts: 14754957.0
Be mindful that each column output matrix contains the data for one barcode in each bamfile. Some barcodes may not be present, or empty of counts, so filtering the count matrix is required to keep only barcodes corresponding to real cells. We provide a CLI tool for this purpose, scCountQC, and an example of its usage in this tutorial.
We can store our count matrix in an AnnData object for later processing and analysis.
Note that AnnData stores data in cell x feature format, so we must transpose the data matrix when assigning it to adata.X. Additionally, we can use bam_files, barcodes, and bin_IDs to add columns to the observation and variable dataframes of our AnnData.
[7]:
# Observation names and columns
sample_names = [Path(bam_file).stem for bam_file in bam_files]
samples = [sample for sample in sample_names for _ in range(len(barcodes))]
cell_IDs = [f"{sample}::{barcode}" for sample in sample_names for barcode in barcodes]
# Variable names and columns
var_names = [var.split('::')[0] for var in bin_IDs]
chroms = [var.split('_')[0] for var in var_names]
starts = [int(var.split('_')[1]) for var in var_names]
ends = [int(var.split('_')[2]) for var in var_names]
# Create anndata object
sortChIC_adata = ad.AnnData(
X=bin_counts.T,
obs={'sample': samples, 'barcode': barcodes * len(sample_names)},
var={'chrom': chroms, 'start': starts, 'end': ends},
)
sortChIC_adata.obs_names = cell_IDs
sortChIC_adata.var_names = bin_IDs
We have given a unique ID to each of the cells in our dataset by concatenating the sample they come from (with :: as a separator), and their individual cell barcode, as we can see below.
[8]:
sortChIC_adata.obs.head()
[8]:
| sample | barcode | |
|---|---|---|
| sortChIC-BM-SL2-k4me1-1::ACACACTA | sortChIC-BM-SL2-k4me1-1 | ACACACTA |
| sortChIC-BM-SL2-k4me1-1::ACACATAG | sortChIC-BM-SL2-k4me1-1 | ACACATAG |
| sortChIC-BM-SL2-k4me1-1::ACACGAGA | sortChIC-BM-SL2-k4me1-1 | ACACGAGA |
| sortChIC-BM-SL2-k4me1-1::ACACTATC | sortChIC-BM-SL2-k4me1-1 | ACACTATC |
| sortChIC-BM-SL2-k4me1-1::ACACTGAT | sortChIC-BM-SL2-k4me1-1 | ACACTGAT |
The ID of each region, corresponds to their position in the chromosome concatenated with the name of each region. Since we counted in bins, the regions lack a name, so it shows as None. In the section below, we will count reads from regions in BED/GTF files, where this will be relevant.
[9]:
sortChIC_adata.var.head()
[9]:
| chrom | start | end | |
|---|---|---|---|
| chr1_182050000_182100000::None | chr1 | 182050000 | 182100000 |
| chr1_182100000_182150000::None | chr1 | 182100000 | 182150000 |
| chr1_182150000_182200000::None | chr1 | 182150000 | 182200000 |
| chr1_182200000_182250000::None | chr1 | 182200000 | 182250000 |
| chr1_182250000_182300000::None | chr1 | 182250000 | 182300000 |
2. Counting reads in genomic features#
In addition to aggregating signal in bins, we can count the read counts for a known feature set. We provide a BED/GTF file containing our regions of interest to CountReadsPerBin instead of binLength and stepSize.
We will produce a count matrix of scATAC-seq signal in chromatin accessibility peaks for a region of chromosome 1, and a scRNA-seq count matrix for the genes in the same region. The data comes from a subset of cells of the 10x multiome dataset published in Persad et. al. (2023).
[10]:
data_dir = '10x_multiome_testdata'
barcodes_path = os.path.join(data_dir, '10x_barcodes_tutorial.txt')
barcodes = [line.strip() for line in open(barcodes_path)]
blacklist = os.path.join(data_dir, 'hg38_blacklist.v2.bed')
region = 'chr1:47000000:47300000'
scATAC-seq counts#
The file atacPeaks_replicateMerged.bedcontains chromatin accessibility peaks. Using it as input for CountReadsPerBin overrides the bin counting parameters. The rows in the resulting matrix each correspond to one region in the file.
We specify that counting should be performed only in our regions of interest, so only peaks within the region will be included.
[11]:
bam_files = [bam_file.path for bam_file in os.scandir(data_dir)
if ('atac' in bam_file.path) and bam_file.path.endswith('.bam')]
peaks_bed = os.path.join(data_dir, 'atacPeaks_replicateMerged.bed')
cellTag = 'CB'
[12]:
counter_atac = CountReadsPerBin(
bam_files,
region = region,
bedFile = peaks_bed,
blackListFileName=blacklist,
barcodes = barcodes,
cellTag = cellTag,
numberOfProcessors=threads,
)
atac_peak_counts, atac_peaks = counter_atac.run()
[13]:
print(f"""
Counted {atac_peak_counts.shape[1]} barcodes and {atac_peak_counts.shape[0]} peaks in {region}.
Total counts: {atac_peak_counts.sum()}
""")
Counted 4256 barcodes and 45 peaks in chr1:47000000:47300000.
Total counts: 32980.0
scRNA-seq#
As with our with counting our scATAC-seq reads, we provide a file with regions for counting, in this case a GTF file containing all the genes in the HG38 human genome assembly. This file can be downloaded here.
The file atacPeaks_replicateMerged.bedcontains chromatin accessibility peaks. Using it as input for CountReadsPerBin overrides the bin counting parameters. The rows in the resulting matrix each correspond to one region in the file.
We specify that counting should be performed only in our regions of interest, so only peaks within the region will be included.
[17]:
bam_files = [bam_file.path for bam_file in os.scandir(data_dir)
if ('gex' in bam_file.path) and bam_file.path.endswith('.bam')]
genes_gtf = os.path.join(data_dir, 'hg38.gtf')
cellTag = 'CB'
[18]:
counter_genes = CountReadsPerBin(
bam_files,
region = region,
bedFile = genes_gtf,
blackListFileName=blacklist,
barcodes = barcodes,
cellTag = cellTag,
numberOfProcessors=threads,
)
rna_gene_counts, genes = counter_genes.run()
[19]:
print(f"""
Counted {rna_gene_counts.shape[1]} barcodes and {rna_gene_counts.shape[0]} genes in {region}.
Total counts: {rna_gene_counts.sum()}
""")
Counted 4256 barcodes and 19 genes in chr1:47000000:47300000.
Total counts: 9890.0
As mentioned in the section before, the names of our regions consist of their genomic position, followed by the region name specified in the BED/GTF file. We can see this below, the ID of each region is followed by its corresponding transcript ID. These can easily be converted into gene names.
[40]:
print(genes)
['chr1_47004185_47050751::NM_001320290'
'chr1_47023668_47050751::NM_178033' 'chr1_47067230_47118318::NM_178134'
'chr1_47137424_47147481::NM_001308102'
'chr1_47137440_47149727::NM_001010969'
'chr1_47179249_47180339::NR_047498' 'chr1_47183581_47190036::NM_005764'
'chr1_47216289_47231715::NM_001287347'
'chr1_47216289_47231889::NM_001290404'
'chr1_47216289_47231889::NM_001290405'
'chr1_47216289_47232335::NM_001290403'
'chr1_47216289_47232335::NM_003189'
'chr1_47216290_47232225::NM_001290406'
'chr1_47250138_47314147::NM_001048166'
'chr1_47250138_47314147::NM_001282936'
'chr1_47250138_47314147::NM_001282937'
'chr1_47250138_47314147::NM_001282938'
'chr1_47250138_47314147::NM_001282939'
'chr1_47250138_47314147::NM_003035']
(Optional) Convert transcript IDs to gene IDs#
The following python function that reads a GTF file into a pandas dataframe. We will read hg38.gtf and use the resulting dataframe to convert our transcript IDs to gene IDs.
[38]:
import re
import pandas as pd
def parse_GTF(gtf_path):
def _parse_attrs(attr):
d = {}
for m in re.finditer(r'\s*([^;\s]+)\s+"([^"]+)"', str(attr)):
d[m.group(1)] = m.group(2)
return d
gtf_df = pd.read_csv(
gtf_path, sep='\t', comment='#', header=None,
names=['seqname', 'source', 'feature', 'start', 'end',
'score', 'strand', 'frame', 'attribute'])
attributes = gtf_df['attribute'].apply(_parse_attrs)
attributes_df = pd.DataFrame(list(attributes)).fillna('')
gtf_df = pd.concat([gtf_df.drop(columns=['attribute']), attributes_df], axis=1)
return gtf_df
[39]:
gtf_df = parse_GTF(genes_gtf)
gtf_df.head()
[39]:
| seqname | source | feature | start | end | score | strand | frame | gene_id | transcript_id | gene_name | exon_number | exon_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr1 | refGene | transcript | 11874 | 14409 | . | + | . | DDX11L1 | NR_046018 | DDX11L1 | ||
| 1 | chr1 | refGene | exon | 11874 | 12227 | . | + | . | DDX11L1 | NR_046018 | DDX11L1 | 1 | NR_046018.1 |
| 2 | chr1 | refGene | exon | 12613 | 12721 | . | + | . | DDX11L1 | NR_046018 | DDX11L1 | 2 | NR_046018.2 |
| 3 | chr1 | refGene | exon | 13221 | 14409 | . | + | . | DDX11L1 | NR_046018 | DDX11L1 | 3 | NR_046018.3 |
| 4 | chr1 | refGene | transcript | 14362 | 29370 | . | - | . | WASH7P | NR_024540 | WASH7P |
The GTF dataframe contains the mapping between transcript IDs and gene IDs. By merging the list of our transcript IDs with the GTF dataframe, we can add make obtain the list of corresponding gene IDs.
[48]:
transcript_to_gene = pd.DataFrame({'transcript_id': [genes.split('::')[1] for genes in genes],})
transcript_to_gene = transcript_to_gene.merge(
gtf_df[['transcript_id', 'gene_name']].drop_duplicates(),
on='transcript_id', how='left',
)
transcript_to_gene
[48]:
| transcript_id | gene_name | |
|---|---|---|
| 0 | NM_001320290 | CYP4X1 |
| 1 | NM_178033 | CYP4X1 |
| 2 | NM_178134 | CYP4Z1 |
| 3 | NM_001308102 | CYP4A22 |
| 4 | NM_001010969 | CYP4A22 |
| 5 | NR_047498 | LINC00853 |
| 6 | NM_005764 | PDZK1IP1 |
| 7 | NM_001287347 | TAL1 |
| 8 | NM_001290404 | TAL1 |
| 9 | NM_001290405 | TAL1 |
| 10 | NM_001290403 | TAL1 |
| 11 | NM_003189 | TAL1 |
| 12 | NM_001290406 | TAL1 |
| 13 | NM_001048166 | STIL |
| 14 | NM_001282936 | STIL |
| 15 | NM_001282937 | STIL |
| 16 | NM_001282938 | STIL |
| 17 | NM_001282939 | STIL |
| 18 | NM_003035 | STIL |