Counting reads from bam files with ReadCounter#

Sincei provides the scCountReads command line tool to produce count matrices stored in AnnData format. This tool is based on the sincei.ReadCounter module which can be used directly in python.

This tutorial demonstrates how to use sincei.ReadCounter to produce read count matrices from bam files. We will use the example data provided in our sc-sortChIC analysis tutorial and our 10x multiome scRNA-seq analysis tutorial.

Both datasets can be downloaded from figshare: sc-sortChIC and 10x sc-RNAseq

You can download this tutorial as a jupyter notebook by clicking the icon on the top right bar.

The ReadCounter module contains a main class, CountReadsPerBin that takes the paths to our bamfiles as input, and returns a feature x cell count matrix in numpy.ndarray format.

[1]:

import os
from pathlib import Path
import anndata as ad

from sincei.ReadCounter import CountReadsPerBin

threads = 16

1. Counting reads in genomic bins#

First, we will produce a count matrix for a subset of the single-cell sortChIC H3K27me3 data from Zeller, Yueng et. al. (2023). We aggregate the signal in 10kb bins.

Our data is located in the sortchic_testdata directory. We can store the paths to our bamfiles and to the file containing our cell barcodes using the following commands. Note that the cell barcodes must be stored in a list of strings, where each string corresponds to one barcode.

Additionally, we need a blacklist of problematic regions in the genome to filter out sequencing artifacts. We provide a blacklist for the mm10 mouse genome in our example data.

[4]:

data_dir = 'sortchic_testdata'
bam_files = [bam_file.path for bam_file in os.scandir(data_dir) if bam_file.path.endswith('.bam')]

barcodes_path = os.path.join(data_dir, 'sortChIC_barcodes.txt')
barcodes = [line.strip() for line in open(barcodes_path)]

blacklist = os.path.join(data_dir, 'mm10_blacklist.bed')

CountReadsPerBin takes the following arguments:

The list of paths to the bamfiles
Optionally, a region to limit the counting to in the format chrom:start:end
The length of the bins to aggregate the signal into
The step size between bins. This is the distance between the start of two consecutive bins. Set equal to binLength to have non-overlapping bins covering the whole region.
The list of barcodes to count cells for, where each barcode corresponds to a cell in each sample.
The BAM tag where the cell barcodes are stored
The path to the blacklist BED/GTF file
The number of cores to use in the counting

We instantiate CountReadsPerBin with all the necessary parameters, and perform the counting by calling CountReadsPerBin.run(). This returns our bin x cell count matrix and numpy array of bin IDs.

Here we count the reads in 50kb bins across chromosome 1.

[5]:

cellTag = 'BC'
region = 'chr1'
bin_length = 50000

counter_bins = CountReadsPerBin(
    bam_files,
    region=region,
    binLength=bin_length,
    stepSize=bin_length,
    barcodes=barcodes,
    cellTag=cellTag,
    blackListFileName=blacklist,
    numberOfProcessors=threads,
    )
bin_counts, bin_IDs = counter_bins.run()

[6]:

print(f"""
Counted {bin_counts.shape[1]} barcodes and {bin_counts.shape[0]} in chr1.
Total counts: {bin_counts.sum()}
""")


Counted 1536 barcodes and 3923 in chr1.
Total counts: 14754957.0

Be mindful that each column output matrix contains the data for one barcode in each bamfile. Some barcodes may not be present, or empty of counts, so filtering the count matrix is required to keep only barcodes corresponding to real cells. We provide a CLI tool for this purpose, scCountQC, and an example of its usage in this tutorial.

We can store our count matrix in an AnnData object for later processing and analysis.

Note that AnnData stores data in cell x feature format, so we must transpose the data matrix when assigning it to adata.X. Additionally, we can use bam_files, barcodes, and bin_IDs to add columns to the observation and variable dataframes of our AnnData.

[7]:

# Observation names and columns
sample_names = [Path(bam_file).stem for bam_file in bam_files]
samples = [sample for sample in sample_names for _ in range(len(barcodes))]
cell_IDs = [f"{sample}::{barcode}" for sample in sample_names for barcode in barcodes]

# Variable names and columns
var_names = [var.split('::')[0] for var in bin_IDs]
chroms = [var.split('_')[0] for var in var_names]
starts = [int(var.split('_')[1]) for var in var_names]
ends = [int(var.split('_')[2]) for var in var_names]

# Create anndata object
sortChIC_adata = ad.AnnData(
    X=bin_counts.T,
    obs={'sample': samples, 'barcode': barcodes * len(sample_names)},
    var={'chrom': chroms, 'start': starts, 'end': ends},
    )
sortChIC_adata.obs_names = cell_IDs
sortChIC_adata.var_names = bin_IDs

We have given a unique ID to each of the cells in our dataset by concatenating the sample they come from (with :: as a separator), and their individual cell barcode, as we can see below.

[8]:

sortChIC_adata.obs.head()

[8]:

	sample	barcode
sortChIC-BM-SL2-k4me1-1::ACACACTA	sortChIC-BM-SL2-k4me1-1	ACACACTA
sortChIC-BM-SL2-k4me1-1::ACACATAG	sortChIC-BM-SL2-k4me1-1	ACACATAG
sortChIC-BM-SL2-k4me1-1::ACACGAGA	sortChIC-BM-SL2-k4me1-1	ACACGAGA
sortChIC-BM-SL2-k4me1-1::ACACTATC	sortChIC-BM-SL2-k4me1-1	ACACTATC
sortChIC-BM-SL2-k4me1-1::ACACTGAT	sortChIC-BM-SL2-k4me1-1	ACACTGAT

The ID of each region, corresponds to their position in the chromosome concatenated with the name of each region. Since we counted in bins, the regions lack a name, so it shows as None. In the section below, we will count reads from regions in BED/GTF files, where this will be relevant.

[9]:

sortChIC_adata.var.head()

[9]:

	chrom	start	end
chr1_182050000_182100000::None	chr1	182050000	182100000
chr1_182100000_182150000::None	chr1	182100000	182150000
chr1_182150000_182200000::None	chr1	182150000	182200000
chr1_182200000_182250000::None	chr1	182200000	182250000
chr1_182250000_182300000::None	chr1	182250000	182300000

2. Counting reads in genomic features#

In addition to aggregating signal in bins, we can count the read counts for a known feature set. We provide a BED/GTF file containing our regions of interest to CountReadsPerBin instead of binLength and stepSize.

We will produce a count matrix of scATAC-seq signal in chromatin accessibility peaks for a region of chromosome 1, and a scRNA-seq count matrix for the genes in the same region. The data comes from a subset of cells of the 10x multiome dataset published in Persad et. al. (2023).

[10]:

data_dir = '10x_multiome_testdata'

barcodes_path = os.path.join(data_dir, '10x_barcodes_tutorial.txt')
barcodes = [line.strip() for line in open(barcodes_path)]

blacklist = os.path.join(data_dir, 'hg38_blacklist.v2.bed')

region = 'chr1:47000000:47300000'

scATAC-seq counts#

The file atacPeaks_replicateMerged.bedcontains chromatin accessibility peaks. Using it as input for CountReadsPerBin overrides the bin counting parameters. The rows in the resulting matrix each correspond to one region in the file.

We specify that counting should be performed only in our regions of interest, so only peaks within the region will be included.

[11]:

bam_files = [bam_file.path for bam_file in os.scandir(data_dir)
             if ('atac' in bam_file.path) and bam_file.path.endswith('.bam')]

peaks_bed = os.path.join(data_dir, 'atacPeaks_replicateMerged.bed')

cellTag = 'CB'

[12]:

counter_atac = CountReadsPerBin(
    bam_files,
    region = region,
    bedFile = peaks_bed,
    blackListFileName=blacklist,
    barcodes = barcodes,
    cellTag = cellTag,
    numberOfProcessors=threads,
    )
atac_peak_counts, atac_peaks = counter_atac.run()

[13]:

print(f"""
Counted {atac_peak_counts.shape[1]} barcodes and {atac_peak_counts.shape[0]} peaks in {region}.
Total counts: {atac_peak_counts.sum()}
""")


Counted 4256 barcodes and 45 peaks in chr1:47000000:47300000.
Total counts: 32980.0

scRNA-seq#

As with our with counting our scATAC-seq reads, we provide a file with regions for counting, in this case a GTF file containing all the genes in the HG38 human genome assembly. This file can be downloaded here.

We specify that counting should be performed only in our regions of interest, so only peaks within the region will be included.

[17]:

bam_files = [bam_file.path for bam_file in os.scandir(data_dir)
             if ('gex' in bam_file.path) and bam_file.path.endswith('.bam')]

genes_gtf = os.path.join(data_dir, 'hg38.gtf')

cellTag = 'CB'

[18]:

counter_genes = CountReadsPerBin(
    bam_files,
    region = region,
    bedFile = genes_gtf,
    blackListFileName=blacklist,
    barcodes = barcodes,
    cellTag = cellTag,
    numberOfProcessors=threads,
    )
rna_gene_counts, genes = counter_genes.run()

[19]:

print(f"""
Counted {rna_gene_counts.shape[1]} barcodes and {rna_gene_counts.shape[0]} genes in {region}.
Total counts: {rna_gene_counts.sum()}
""")


Counted 4256 barcodes and 19 genes in chr1:47000000:47300000.
Total counts: 9890.0

As mentioned in the section before, the names of our regions consist of their genomic position, followed by the region name specified in the BED/GTF file. We can see this below, the ID of each region is followed by its corresponding transcript ID. These can easily be converted into gene names.

[40]:

print(genes)

['chr1_47004185_47050751::NM_001320290'
 'chr1_47023668_47050751::NM_178033' 'chr1_47067230_47118318::NM_178134'
 'chr1_47137424_47147481::NM_001308102'
 'chr1_47137440_47149727::NM_001010969'
 'chr1_47179249_47180339::NR_047498' 'chr1_47183581_47190036::NM_005764'
 'chr1_47216289_47231715::NM_001287347'
 'chr1_47216289_47231889::NM_001290404'
 'chr1_47216289_47231889::NM_001290405'
 'chr1_47216289_47232335::NM_001290403'
 'chr1_47216289_47232335::NM_003189'
 'chr1_47216290_47232225::NM_001290406'
 'chr1_47250138_47314147::NM_001048166'
 'chr1_47250138_47314147::NM_001282936'
 'chr1_47250138_47314147::NM_001282937'
 'chr1_47250138_47314147::NM_001282938'
 'chr1_47250138_47314147::NM_001282939'
 'chr1_47250138_47314147::NM_003035']

(Optional) Convert transcript IDs to gene IDs#

The following python function that reads a GTF file into a pandas dataframe. We will read hg38.gtf and use the resulting dataframe to convert our transcript IDs to gene IDs.

[38]:

import re
import pandas as pd

def parse_GTF(gtf_path):

    def _parse_attrs(attr):
        d = {}
        for m in re.finditer(r'\s*([^;\s]+)\s+"([^"]+)"', str(attr)):
            d[m.group(1)] = m.group(2)
        return d

    gtf_df = pd.read_csv(
        gtf_path, sep='\t', comment='#', header=None,
        names=['seqname', 'source', 'feature', 'start', 'end',
               'score', 'strand', 'frame', 'attribute'])

    attributes = gtf_df['attribute'].apply(_parse_attrs)
    attributes_df = pd.DataFrame(list(attributes)).fillna('')

    gtf_df = pd.concat([gtf_df.drop(columns=['attribute']), attributes_df], axis=1)

    return gtf_df

[39]:

gtf_df = parse_GTF(genes_gtf)
gtf_df.head()

[39]:

	seqname	source	feature	start	end	score	strand	frame	gene_id	transcript_id	gene_name	exon_number	exon_id
0	chr1	refGene	transcript	11874	14409	.	+	.	DDX11L1	NR_046018	DDX11L1
1	chr1	refGene	exon	11874	12227	.	+	.	DDX11L1	NR_046018	DDX11L1	1	NR_046018.1
2	chr1	refGene	exon	12613	12721	.	+	.	DDX11L1	NR_046018	DDX11L1	2	NR_046018.2
3	chr1	refGene	exon	13221	14409	.	+	.	DDX11L1	NR_046018	DDX11L1	3	NR_046018.3
4	chr1	refGene	transcript	14362	29370	.	-	.	WASH7P	NR_024540	WASH7P

The GTF dataframe contains the mapping between transcript IDs and gene IDs. By merging the list of our transcript IDs with the GTF dataframe, we can add make obtain the list of corresponding gene IDs.

[48]:

transcript_to_gene = pd.DataFrame({'transcript_id': [genes.split('::')[1] for genes in genes],})

transcript_to_gene = transcript_to_gene.merge(
    gtf_df[['transcript_id', 'gene_name']].drop_duplicates(),
    on='transcript_id', how='left',
)
transcript_to_gene

[48]:

	transcript_id	gene_name
0	NM_001320290	CYP4X1
1	NM_178033	CYP4X1
2	NM_178134	CYP4Z1
3	NM_001308102	CYP4A22
4	NM_001010969	CYP4A22
5	NR_047498	LINC00853
6	NM_005764	PDZK1IP1
7	NM_001287347	TAL1
8	NM_001290404	TAL1
9	NM_001290405	TAL1
10	NM_001290403	TAL1
11	NM_003189	TAL1
12	NM_001290406	TAL1
13	NM_001048166	STIL
14	NM_001282936	STIL
15	NM_001282937	STIL
16	NM_001282938	STIL
17	NM_001282939	STIL
18	NM_003035	STIL

Counting reads from bam files with ReadCounter

Contents