scFilterStats

This tool estimates the number of reads that would be filtered given a set of settings and prints this to the terminal. Further, it tracks the number of singleton reads. The following metrics will always be tracked regardless of what you specify (the order output also matches this):

Total reads (including unmapped)

Mapped reads

Reads in blacklisted regions (–blackListFileName)

The following metrics are estimated according to the –binSize and –distanceBetweenBins parameters

Estimated mapped reads filtered (the total number of mapped reads filtered for any reason)
Alignments with a below threshold MAPQ (–minMappingQuality)
Alignments with at least one missing flag (–samFlagInclude)
Alignments with undesirable flags (–samFlagExclude)
Duplicates determined by sincei (–duplicateFilter)
Duplicates marked externally (e.g., by picard)
Singletons (paired-end reads with only one mate aligning)
Wrong strand (due to –filterRNAstrand)

The sum of these may be more than the total number of reads. Note that alignments are sampled from bins of size –binSize spaced –distanceBetweenBins apart.

usage: Example usage: scFilterStats.py -b sample1.bam sample2.bam -bc barcodes.txt > log.txt

Input/Output options

--bamfiles, -b: List of indexed bam files separated by spaces.
--barcodes, -bc: A single-column file containing barcodes (whitelist) to be used for the analysis.
--outFile, -o: The file to write results to. For scFilterStats, scFilterBarcodes and scJSD, the output file is a .txt file. For other tools, the output file is an updated .loom object with the result of the requested operation.

BAM processing options

--cellTag, -ct: Name of the BAM tag from which to extract barcodes.
--numberOfProcessors, -p: Number of processors to use. Type “max/2” to use half the maximum number of processors or “max” to use all available processors. (Default: 1)
--labels, -l: User defined labels instead of default labels from file names. Multiple labels have to be separated by a space, e.g. –labels sample1 sample2 sample3
--smartLabels: Instead of manually specifying labels for the input BAM files, this causes sincei to use the file name after removing the path and extension.
--blackListFileName, -bl: A BED or GTF file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over it, then the read/fragment might still be considered. Please note that you should adjust the effective genome size, if relevant.
--binSize, -bs: Size of the bins, in bases, to calculate coverage (Default: 100000)
--distanceBetweenBins: The gap distance between bins during counting. Larger numbers can be used to sample the genome for input files with high coverage while smaller values are useful for lower coverage data. Note that if you specify a value that results in too few (<1000) reads sampled, the value will be decreased. (Default: 1000000)

Read Filtering Options

--duplicateFilter

Possible choices: start_bc, start_bc_umi, start_end_bc, start_end_bc_umi

How to filter for duplicates? Different combinations (using start/end/umi) are possible. Read start position and read barcode are always considered. Default (None) would consider all reads. Note that in case of paired end data, both reads in the fragment are considered (and kept). So if you wish to keep only read1, combine this option with samFlagInclude

--motifFilter, -m

Check whether a given motif is present in the read and the corresponding reference genome. This option checks for the motif at the 5-end of the read and at the 5-overhang in the genome, which is useful in identifying reads properly cut by a restriction-enzyme or MNAse. For example, if you want to search for an “A” at the 5’-end of the read and “TA” at 5’-overhang, use “-m ‘A,TA’”. Reads not containing the given motif are filtered out.

--genome2bit, -g

If –motifFilter is provided, please also provide the genome sequence (in 2bit format).

--GCcontentFilter, -gc

Check whether the GC content of the read falls within the provided range Input format must be ‘<low>,<high>’ , where <low> is the lower bound and <high> is the upper bound in the fraction of GC (eg. ‘0.1,0.9’ ). If the GC content of the reads fall outside the range, they are filtered out.

--minAlignedFraction

Minimum fraction of the reads which should be aligned to be counted. This includes mismatches tolerated by the aligners, but excludes InDels/Clippings (Default: None)

Read Processing Options

--minMappingQuality

If set, only reads that have a mapping quality score of at least this are considered.

--samFlagInclude

Include reads based on the SAM flag. For example, to get only reads that are the first mate, use a flag of 64. This is useful to count properly paired reads only once, as otherwise the second mate will be also considered for the coverage. (Default: None)

--samFlagExclude

Exclude reads based on the SAM flag. For example, to get only reads that map to the forward strand, use –samFlagExclude 16, where 16 is the SAM flag for reads that map to the reverse strand. (Default: None)

--filterRNAstrand

Possible choices: forward, reverse

Selects RNA-seq reads (single-end or paired-end) originating from genes on the given strand. This option assumes a standard dUTP-based library preparation (that is, –filterRNAstrand=forward keeps minus-strand reads, which originally came from genes on the forward strand using a dUTP-based method). Consider using –samExcludeFlag instead for filtering by strand in other contexts.

Other options

--verbose, -v: Set to see processing messages.
--version: show program’s version number and exit