scBulkCoverage

This tool takes alignments of reads or fragments as input (BAM files), along with cell grouping information, such as barcode -> batch, or barcode -> cluster, as tsv file, and generates a coverage track (bigWig or bedGraph) per group as output. The coverage is calculated as the number of reads per bin, where bins are short consecutive counting windows of a defined size. It is possible to extended/change the length of the reads to better reflect the actual fragment length. scBulkCoverage offers normalization per cluster using different methods

usage: An example usage is:$ scBulkCoverage -b file1.bam file2.bam --labels file1 file2 -g scClusterCells_output.tsv -o coverage.bw

Input/Output options

--bamfiles, -b

List of indexed bam files separated by spaces.

--groupInfo, -i

A 3-column tsv file with Cell grouping information in the format: sample, barcode, group. Coverages will be computed per group.

--outFilePrefix, -o

Output file name prefix.

BAM processing options

--cellTag, -ct

Name of the BAM tag from which to extract barcodes.

--groupTag, -gt

In case of a groupped BAM file, such as the one containing Read Group (RG) or Sample (SM) tag,it is possible to process group the reads using the provided –groupTag argument. NOTE: In case of such input, please ensure that the –labels argument indicates the expected group labels contained in the BAM files. The –groupTag along with the –cellTag is then used to identify unique samples (cells) from the input.

--numberOfProcessors, -p

Number of processors to use. Type “max/2” to use half the maximum number of processors or “max” to use all available processors. (Default: 1)

--labels, -l

User defined labels instead of default labels from file names. Multiple labels have to be separated by a space, e.g. –labels sample1 sample2 sample3

--smartLabels

Instead of manually specifying labels for the input BAM files, this causes sincei to use the file name after removing the path and extension.

--region, -r

Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time. The format is chr:start:end, for example –region chr10 or –region chr10:456700:891000.

--blackListFileName, -bl

A BED or GTF file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered. Please note that you should adjust the effective genome size, if relevant.

--binSize, -bs

Size of the bins, in bases, to calculate coverage (Default: 100)

--distanceBetweenBins

The gap distance between bins during counting. Larger numbers can be used to sample the genome for input files with high coverage while smaller values are useful for lower coverage data. Note that if you specify a value that results in too few (<1000) reads sampled, the value will be decreased. (Default: None)

Read Filtering Options

--duplicateFilter

Possible choices: start_bc, start_bc_umi, start_end_bc, start_end_bc_umi

How to filter for duplicates? Different combinations (using start/end/umi) are possible. Read start position and read barcode are always considered. Default (None) would consider all reads. Note that in case of paired end data, both reads in the fragment are considered (and kept). So if you wish to keep only read1, combine this option with samFlagInclude

--motifFilter, -m

Check whether a given motif is present in the read and the corresponding reference genome. This option checks for the motif at the 5-end of the read and at the 5-overhang in the genome, which is useful in identifying reads properly cut by a restriction-enzyme or MNAse. For example, if you want to search for an “A” at the 5’-end of the read and “TA” at 5’-overhang, use “-m ‘A,TA’”. Reads not containing the given motif are filtered out.

--genome2bit, -g

If –motifFilter is provided, please also provide the genome sequence (in 2bit format).

--GCcontentFilter, -gc

Check whether the GC content of the read falls within the provided range Input format must be ‘<low>,<high>’ , where <low> is the lower bound and <high> is the upper bound in the fraction of GC (eg. ‘0.1,0.9’ ). If the GC content of the reads fall outside the range, they are filtered out.

--minAlignedFraction

Minimum fraction of the reads which should be aligned to be counted. This includes mismatches tolerated by the aligners, but excludes InDels/Clippings (Default: None)

Read Processing Options

--minMappingQuality

If set, only reads that have a mapping quality score of at least this are considered.

--samFlagInclude

Include reads based on the SAM flag. For example, to get only reads that are the first mate, use a flag of 64. This is useful to count properly paired reads only once, as otherwise the second mate will be also considered for the coverage. (Default: None)

--samFlagExclude

Exclude reads based on the SAM flag. For example, to get only reads that map to the forward strand, use –samFlagExclude 16, where 16 is the SAM flag for reads that map to the reverse strand. (Default: None)

--minFragmentLength

The minimum fragment length needed for read/pair inclusion. This option is primarily useful in ATACseq experiments, for filtering mono- or di-nucleosome fragments. (Default: 0)

--maxFragmentLength

The maximum fragment length needed for read/pair inclusion. (Default: 0)

--filterRNAstrand

Possible choices: forward, reverse

Selects RNA-seq reads (single-end or paired-end) originating from genes on the given strand. This option assumes a standard dUTP-based library preparation (that is, –filterRNAstrand=forward keeps minus-strand reads, which originally came from genes on the forward strand using a dUTP-based method). Consider using –samExcludeFlag instead for filtering by strand in other contexts.

--extendReads, -e

This parameter allows the extension of reads to fragment size. If set, each read is extended, without exception. NOTE: This feature is generally NOT recommended for spliced-read data, such as RNA-seq, as it would extend reads over skipped regions. Single-end: Requires a user specified value for the final fragment length. Reads that already exceed this fragment length will not be extended. Paired-end: Reads with mates are always extended to match the fragment size defined by the two read mates. Unmated reads, mate reads that map too far apart (>4x fragment length) or even map to different chromosomes are treated like single-end reads. The input of a fragment length value is optional. If no value is specified, it is estimated from the data (mean of the fragment size of all mate reads).

--centerReads

By adding this option, reads are centered with respect to the fragment length. For paired-end data, the read is centered at the fragment length defined by the two ends of the fragment. For single-end data, the given fragment length is used. This option is useful to get a sharper signal around enriched regions.

Other options

--verbose, -v

Set to see processing messages.

--version

show program’s version number and exit