scCountReads

scCountReads computes the read coverages per cell barcode for genomic regions in the provided BAM file(s). The analysis can be performed for the entire genome by running the program in ‘bins’ mode. If you want to count the read coverage for specific regions only, use the features mode instead. The standard output of scCountReads is a “.loom” file with counts, along with rowName (features) and colNames (cell barcodes).

A detailed sub-commands help is available by typing:

scCountReads bins -h

scCountReads features -h

usage: scCountReads [-h]  ...

commands

subcommands

Possible choices: bins, features

subcommands

Sub-commands

bins

The reads are counted in bins of equal size. The bin size and distance between bins can be adjusted.

scCountReads bins -bs 100000 --bamfiles file1.bam file2.bam -o results

Input/Output options

--bamfiles, -b

List of indexed bam files separated by spaces.

--barcodes, -bc

A single-column file containing barcodes (whitelist) to be used for the analysis.

--outFilePrefix, -o

Output file name prefix.

BAM processing options

--cellTag, -ct

Name of the BAM tag from which to extract barcodes.

--groupTag, -gt

In case of a groupped BAM file, such as the one containing Read Group (RG) or Sample (SM) tag,it is possible to process group the reads using the provided –groupTag argument. NOTE: In case of such input, please ensure that the –labels argument indicates the expected group labels contained in the BAM files. The –groupTag along with the –cellTag is then used to identify unique samples (cells) from the input.

--numberOfProcessors, -p

Number of processors to use. Type “max/2” to use half the maximum number of processors or “max” to use all available processors. (Default: 1)

--labels, -l

User defined labels instead of default labels from file names. Multiple labels have to be separated by a space, e.g. –labels sample1 sample2 sample3

--smartLabels

Instead of manually specifying labels for the input BAM files, this causes sincei to use the file name after removing the path and extension.

--region, -r

Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time. The format is chr:start:end, for example –region chr10 or –region chr10:456700:891000.

--blackListFileName, -bl

A BED or GTF file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered. Please note that you should adjust the effective genome size, if relevant.

--binSize, -bs

Size of the bins, in bases, to calculate coverage (Default: 10000)

--distanceBetweenBins

The gap distance between bins during counting. Larger numbers can be used to sample the genome for input files with high coverage while smaller values are useful for lower coverage data. Note that if you specify a value that results in too few (<1000) reads sampled, the value will be decreased. (Default: 0)

Read Processing Options

--minMappingQuality

If set, only reads that have a mapping quality score of at least this are considered.

--samFlagInclude

Include reads based on the SAM flag. For example, to get only reads that are the first mate, use a flag of 64. This is useful to count properly paired reads only once, as otherwise the second mate will be also considered for the coverage. (Default: None)

--samFlagExclude

Exclude reads based on the SAM flag. For example, to get only reads that map to the forward strand, use –samFlagExclude 16, where 16 is the SAM flag for reads that map to the reverse strand. (Default: None)

--minFragmentLength

The minimum fragment length needed for read/pair inclusion. This option is primarily useful in ATACseq experiments, for filtering mono- or di-nucleosome fragments. (Default: 0)

--maxFragmentLength

The maximum fragment length needed for read/pair inclusion. (Default: 0)

--extendReads, -e

This parameter allows the extension of reads to fragment size. If set, each read is extended, without exception. NOTE: This feature is generally NOT recommended for spliced-read data, such as RNA-seq, as it would extend reads over skipped regions. Single-end: Requires a user specified value for the final fragment length. Reads that already exceed this fragment length will not be extended. Paired-end: Reads with mates are always extended to match the fragment size defined by the two read mates. Unmated reads, mate reads that map too far apart (>4x fragment length) or even map to different chromosomes are treated like single-end reads. The input of a fragment length value is optional. If no value is specified, it is estimated from the data (mean of the fragment size of all mate reads).

--centerReads

By adding this option, reads are centered with respect to the fragment length. For paired-end data, the read is centered at the fragment length defined by the two ends of the fragment. For single-end data, the given fragment length is used. This option is useful to get a sharper signal around enriched regions.

Read Filtering Options

--duplicateFilter

Possible choices: start_bc, start_bc_umi, start_end_bc, start_end_bc_umi

How to filter for duplicates? Different combinations (using start/end/umi) are possible. Read start position and read barcode are always considered. Default (None) would consider all reads. Note that in case of paired end data, both reads in the fragment are considered (and kept). So if you wish to keep only read1, combine this option with samFlagInclude

--motifFilter, -m

Check whether a given motif is present in the read and the corresponding reference genome. This option checks for the motif at the 5-end of the read and at the 5-overhang in the genome, which is useful in identifying reads properly cut by a restriction-enzyme or MNAse. For example, if you want to search for an “A” at the 5’-end of the read and “TA” at 5’-overhang, use “-m ‘A,TA’”. Reads not containing the given motif are filtered out.

--genome2bit, -g

If –motifFilter is provided, please also provide the genome sequence (in 2bit format).

--GCcontentFilter, -gc

Check whether the GC content of the read falls within the provided range Input format must be ‘<low>,<high>’ , where <low> is the lower bound and <high> is the upper bound in the fraction of GC (eg. ‘0.1,0.9’ ). If the GC content of the reads fall outside the range, they are filtered out.

--minAlignedFraction

Minimum fraction of the reads which should be aligned to be counted. This includes mismatches tolerated by the aligners, but excludes InDels/Clippings (Default: None)

Misc arguments

--genomeChunkSize

Manually specify the size of the genome provided to each processor. The default value of None specifies that this is determined by read density of the BAM file.

--outFileFormat

Possible choices: loom, mtx

Output file format. Default is to write an anndata object of name <prefix>.loom, which can either be opened in scanpy, or by downstream tools. “mtx” refers to the MatrixMarket sparse-matrix format. The output in this case would be <prefix>.counts.mtx, along with <prefix>.rownames.txt and <prefix>.colnames.txt

Other options

--verbose, -v

Set to see processing messages.

--version

show program’s version number and exit

features

The user provides a BED/GTF file containing all regions that should be counted. A common use would be to count scRNA-seq reads on Genes.

scCountReads features --BED selection.bed --bamfiles file1.bam file2.bam --barcodes whitelist.txt -o results

Input/Output options

--bamfiles, -b

List of indexed bam files separated by spaces.

--barcodes, -bc

A single-column file containing barcodes (whitelist) to be used for the analysis.

--BED

Limits the coverage analysis to the regions specified in these files.

--outFilePrefix, -o

Output file name prefix.

GTF/BED12 options

--metagene

When either a BED12 or GTF file are used to provide regions, perform the computation on the merged exons, rather than using the genomic interval defined by the 5-prime and 3-prime most transcript bound (i.e., columns 2 and 3 of a BED file). If a BED3 or BED6 file is used as input, then columns 2 and 3 are used as an exon. (Default: False)

--transcriptID

When a GTF file is used to provide regions, only entries with this value as their feature (column 3) will be processed as transcripts. (Default: “transcript”)

--exonID

When a GTF file is used to provide regions, only entries with this value as their feature (column 3) will be processed as exons. CDS would be another common value for this. (Default: “exon”)

--transcript_id_designator

Each region has an ID (e.g., ACTB) assigned to it, which for BED files is either column 4 (if it exists) or the interval bounds. For GTF files this is instead stored in the last column as a key:value pair (e.g., as ‘transcript_id “ACTB”’, for a key of transcript_id and a value of ACTB). In some cases it can be convenient to use a different identifier. To do so, set this to the desired key. (Default: “transcript_id”)

BAM processing options

--cellTag, -ct

Name of the BAM tag from which to extract barcodes.

--groupTag, -gt

In case of a groupped BAM file, such as the one containing Read Group (RG) or Sample (SM) tag,it is possible to process group the reads using the provided –groupTag argument. NOTE: In case of such input, please ensure that the –labels argument indicates the expected group labels contained in the BAM files. The –groupTag along with the –cellTag is then used to identify unique samples (cells) from the input.

--numberOfProcessors, -p

Number of processors to use. Type “max/2” to use half the maximum number of processors or “max” to use all available processors. (Default: 1)

--labels, -l

User defined labels instead of default labels from file names. Multiple labels have to be separated by a space, e.g. –labels sample1 sample2 sample3

--smartLabels

Instead of manually specifying labels for the input BAM files, this causes sincei to use the file name after removing the path and extension.

--region, -r

Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time. The format is chr:start:end, for example –region chr10 or –region chr10:456700:891000.

--blackListFileName, -bl

A BED or GTF file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered. Please note that you should adjust the effective genome size, if relevant.

Read Processing Options

--minMappingQuality

If set, only reads that have a mapping quality score of at least this are considered.

--samFlagInclude

Include reads based on the SAM flag. For example, to get only reads that are the first mate, use a flag of 64. This is useful to count properly paired reads only once, as otherwise the second mate will be also considered for the coverage. (Default: None)

--samFlagExclude

Exclude reads based on the SAM flag. For example, to get only reads that map to the forward strand, use –samFlagExclude 16, where 16 is the SAM flag for reads that map to the reverse strand. (Default: None)

--minFragmentLength

The minimum fragment length needed for read/pair inclusion. This option is primarily useful in ATACseq experiments, for filtering mono- or di-nucleosome fragments. (Default: 0)

--maxFragmentLength

The maximum fragment length needed for read/pair inclusion. (Default: 0)

--extendReads, -e

This parameter allows the extension of reads to fragment size. If set, each read is extended, without exception. NOTE: This feature is generally NOT recommended for spliced-read data, such as RNA-seq, as it would extend reads over skipped regions. Single-end: Requires a user specified value for the final fragment length. Reads that already exceed this fragment length will not be extended. Paired-end: Reads with mates are always extended to match the fragment size defined by the two read mates. Unmated reads, mate reads that map too far apart (>4x fragment length) or even map to different chromosomes are treated like single-end reads. The input of a fragment length value is optional. If no value is specified, it is estimated from the data (mean of the fragment size of all mate reads).

--centerReads

By adding this option, reads are centered with respect to the fragment length. For paired-end data, the read is centered at the fragment length defined by the two ends of the fragment. For single-end data, the given fragment length is used. This option is useful to get a sharper signal around enriched regions.

Read Filtering Options

--duplicateFilter

Possible choices: start_bc, start_bc_umi, start_end_bc, start_end_bc_umi

How to filter for duplicates? Different combinations (using start/end/umi) are possible. Read start position and read barcode are always considered. Default (None) would consider all reads. Note that in case of paired end data, both reads in the fragment are considered (and kept). So if you wish to keep only read1, combine this option with samFlagInclude

--motifFilter, -m

Check whether a given motif is present in the read and the corresponding reference genome. This option checks for the motif at the 5-end of the read and at the 5-overhang in the genome, which is useful in identifying reads properly cut by a restriction-enzyme or MNAse. For example, if you want to search for an “A” at the 5’-end of the read and “TA” at 5’-overhang, use “-m ‘A,TA’”. Reads not containing the given motif are filtered out.

--genome2bit, -g

If –motifFilter is provided, please also provide the genome sequence (in 2bit format).

--GCcontentFilter, -gc

Check whether the GC content of the read falls within the provided range Input format must be ‘<low>,<high>’ , where <low> is the lower bound and <high> is the upper bound in the fraction of GC (eg. ‘0.1,0.9’ ). If the GC content of the reads fall outside the range, they are filtered out.

--minAlignedFraction

Minimum fraction of the reads which should be aligned to be counted. This includes mismatches tolerated by the aligners, but excludes InDels/Clippings (Default: None)

Misc arguments

--genomeChunkSize

Manually specify the size of the genome provided to each processor. The default value of None specifies that this is determined by read density of the BAM file.

--outFileFormat

Possible choices: loom, mtx

Output file format. Default is to write an anndata object of name <prefix>.loom, which can either be opened in scanpy, or by downstream tools. “mtx” refers to the MatrixMarket sparse-matrix format. The output in this case would be <prefix>.counts.mtx, along with <prefix>.rownames.txt and <prefix>.colnames.txt

Other options

--verbose, -v

Set to see processing messages.

--version

show program’s version number and exit

example usages: scCountReads bins –bamfiles file1.bam file2.bam –barcodes whitelist.txt -o results

scCountReads features –BED selection.bed –bamfiles file1.bam file2.bam –barcodes whitelist.txt -o results