sincei.ReadCounter module#

class sincei.ReadCounter.CountReadsPerBin(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]#

Bases: object

Collects coverage over multiple bam files using multiprocessing

This class uses multiprocessing to compute the read counts (coverage) from several bam files and returns a numpy array with the resulting count matrix in feature x cell format and another numpy array containing the names of the features.

Parameters#

bamFilesListlist: List containing the paths of indexed bam files. E.g. ['file1.bam', 'file2.bam']
binLengthint: Length of the window/bin. This value is overruled by bedFile if present.
barcodeslist: list of barcodes to count the reads from.
numberOfSamplesint: Total number of samples. The genome is divided into numberOfSamples, each with a window/bin length equal to binLength. This value is overruled by stepSize and by bedFile in which case the number of samples are the regions in the bed file.
numberOfProcessorsint: Number of processors to use. Default: 4
verbosebool: Output messages. Default: False
regionstr: Region to limit the computation in the form chrom:start:end. Default: None
bedFilelist: List of file paths corresponding to bed files containing the regions for which to compute the coverage. This option overrules binLength, numberOfSamples and stepSize.
blackListFileNamestr: A string containing a BED file with blacklist regions.
extendReadsbool, int: Whether coverage should be computed for the extended read length (i.e. the region covered by the two mates or the regions expected to be covered by single-reads). If the value is 'int', then then this is interpreted as the fragment length to extend reads that are not paired. For Illumina reads, usual values are around 300. This value can be determined using the peak caller MACS2 or can be approximated by the fragment lengths computed when preparing the library for sequencing. If the value is of the variable is true and not value is given, the fragment size is sampled from the library but only if the library is paired-end. Default: False
minMappingQualityint: Reads of a mapping quality less than the give value are not considered. Default: None
duplicateFilterstr: Type of duplicate filter to use (same start, end position, umi and barcodes. If paired-end, same start-end for mates) are to be excluded. Default: None
chrToSkip: list: List with names of chromosomes that do not want to be included in the coverage computation. This is useful to remove unwanted chromosomes (e.g. 'random' or 'Het').
stepSizeint: the positions for which the coverage is computed are defined as follows: range(start, end, stepSize). Thus, a stepSize of 1, will compute the coverage at each base pair. If the stepSize is equal to the binLength then the coverage is computed for consecutive bins. If stepSize is smaller than the binLength, then bins will overlap.
center_readbool: Determines if reads should be centered with respect to the fragment length.
samFlag_includeint: Extracts only those reads having the SAM flag. For example, to get only reads that are the first mates a samFlag of 64 could be used. Similarly, the samFlag_include can be used to select only reads mapping on the reverse strand or to get only properly paired reads.
samFlag_excludeint: Removes reads that match the SAM flag. For example to get all reads that map to the forward strand a samFlag_exlude 16 should be used. Which translates into exclude all reads that map to the reverse strand.
zerosToNansbool: If true, zero values encountered are transformed to Nans. Default: False
skipZeroOverZerobool: If true, skip bins where all input BAM files have no coverage.
minFragmentLengthint: If greater than 0, fragments below this size are excluded.
maxFragmentLengthint: If greater than 0, fragments above this size are excluded.
minAlignedFractionfloat: fragments where less than the given fraction of bases align are excluded.
motifFilterlist: Only alignments with reads containing the given motif at 5'-end and genome 5'-end are counted.
GCcontentFilterlist: Only alignments with given min and max GC content are counted.
genome2bitstr: 2 bit file for the genome (required if motifFilter is specified).
out_file_for_raw_datastr: File name to save the raw counts computed.
statsListlist: For each BAM file in bamFilesList, the associated per-chromosome statistics returned by openBam.
mappedListlist: For each BAM file in bamFilesList, the number of mapped reads in the file.
bed_and_binboolean: If true AND a bedFile is given, compute coverage of each bin of the given size in each region of bedFile.
sumCoveragePerBinboolean: If true return cumulative coverage per bin, instead of total read counts (for plotFingerPrint).
genomeChunkSizeint: If not None, the length of the genome used for multiprocessing.

Examples#

The test data contains reads for 200 bp.

>>> test = Tester()

The transpose function is used to get a nicer looking output. The first line corresponds to the number of reads per bin in bam file 1.

>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4)
>>> np.transpose(c.run())
array([[0., 0., 1., 1.],
       [0., 1., 1., 2.]])

run(allArgs=None)[source]#

Run the read counting according to the parameters specified when CountReadsPerBin is initialized.

Returns#

tuple [np.ndarray, np.ndarray]: A tuple containing a numpy array with the resulting count matrix in feature x cell format and another numpy array containing the names of the features.

count_reads_in_region(chrom, start, end, bed_regions_list=None)[source]#

Counts the reads in each bam file at each 'stepSize' position within the interval (start, end) for a window or bin of size binLength.

The stepSize controls the distance between bins. For example, a step size of 20 and a bin size of 20 will create bins next to each other. If the step size is smaller than the bin size the bins will overlap.

If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.

Parameters#

chromstr: Chrom name.
startint: start coordinate.
endint: end coordinate.
barcodes: list: List of barcodes to count (currently set for BAM tag 'BC').
bed_regions_list: list: List of list of tuples of the form (start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

Returns#

np.ndarray: The result is a numpy array that as rows each bin and as columns each bam file.

Examples#

Initialize some useful values

>>> test = Tester()
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 25, 0, stepSize=50)

The transpose is used to get better looking numbers. The first line corresponds to the number of reads per bin in the first bamfile.

>>> _array, __ = c.count_reads_in_region(test.chrom, 0, 200)
>>> _array
array([[0., 0.],
       [0., 1.],
       [1., 1.],
       [1., 2.]])

get_coverage_of_region(bamHandle, chrom, regions, fragmentFromRead_func=None)[source]#

Returns a numpy array that corresponds to the number of reads that overlap with each tile.

Parameters#

bamHandlepysam.AlignmentFile: An open pysam.AlignmentFile object representing the BAM file to read from.
chromstr: The name of the chromosome to analyze.
regionslist of tuples: A list of tuples specifying the regions to analyze. Each tuple should contain start and end positions.
fragmentFromRead_funcfunction, optional: A function to extract fragment information from a read. If not provided, the default method get_fragment_from_read is used.

Returns#

np.ndarray: Array of coverage values for each tile.

Examples#

Initialize some useful values

>>> test = Tester()
>>> import pysam
>>> c = CountReadsPerBin([], stepSize=1, extendReads=300)

For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested.

>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000833, 5000834), (5000834, 5000835)])
array([4., 5.])

In the following example a paired read is extended to the fragment length which is 100 The first mate starts at 5000000 and the second at 5000064. Each mate is extended to the fragment length independently At position 500090-500100 one fragment of length 100 overlap, and after position 5000101 there should be zero reads.

>>> c.zerosToNans = True
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000090, 5000100), (5000100, 5000110)])
array([ 1., nan])

In the following case the reads length is 50. Reads are not extended.

>>> c.extendReads=False
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', [(148, 150), (150, 152), (152, 154)])
array([1., 2., 2.])

getReadLength(read)[source]#

Returns the length of the read.

Parameters#

read : pysam read object

Returns#

int: Length of the read.

get_fragment_from_read(read)[source]#

Get read start and end position of a read. If given, the reads are extended as follows: If reads are paired end, each read mate is extended to match the fragment length, otherwise, a default fragment length is used. If reads are split (give by the CIGAR string) then the multiple positions of the read are returned. When reads are extended the cigar information is skipped.

The following values are defined (for forward reads):

     |--          -- read.tlen --              --|
     |-- read.alen --|
-----|===============>------------<==============|----
     |               |            |
read.reference_start
            read.reference_end  read.pnext

  and for reverse reads


     |--             -- read.tlen --           --|
                                 |-- read.alen --|
-----|===============>-----------<===============|----
     |                           |               |
  read.pnext           read.reference_start  read.reference_end

this is a sketch of a pair-end reads

The function returns the fragment start and end, either using the paired end information (if available) or extending the read in the appropriate direction if this is single-end.

Parameters#

read : pysam read object

Returns#

list of tuples: [(fragment start, fragment end)]

Examples#

>>> test = Tester()
>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True)
>>> c.defaultFragmentLength=100
>>> c.get_fragment_from_read(test.getRead("paired-forward"))
[(5000000, 5000100)]
>>> c.get_fragment_from_read(test.getRead("paired-reverse"))
[(5000000, 5000100)]
>>> c.defaultFragmentLength = 200
>>> c.get_fragment_from_read(test.getRead("single-forward"))
[(5001491, 5001691)]
>>> c.get_fragment_from_read(test.getRead("single-reverse"))
[(5001536, 5001736)]
>>> c.defaultFragmentLength = 'read length'
>>> c.get_fragment_from_read(test.getRead("single-forward"))
[(5001491, 5001527)]
>>> c.defaultFragmentLength = 'read length'
>>> c.extendReads = False
>>> c.get_fragment_from_read(test.getRead("paired-forward"))
[(5000000, 5000036)]

Tests for read centering.

>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True, center_read=True)
>>> c.defaultFragmentLength = 100
>>> assert(c.get_fragment_from_read(test.getRead("paired-forward")) == [(5000032, 5000068)])
>>> c.defaultFragmentLength = 200
>>> assert(c.get_fragment_from_read(test.getRead("single-reverse")) == [(5001618, 5001654)])

getSmoothRange(tileIndex, tileSize, smoothRange, maxPosition)[source]#

Given a tile index position and a tile size (length), return the a new indices over a larger range, called the smoothRange. This region is centered in the tileIndex an spans on both sizes to cover the smoothRange. The smoothRange is trimmed in case it is less than zero or greater than maxPosition

---------------|==================|------------------
           tileIndex
      |--------------------------------------|
      |    <--      smoothRange     -->      |
      |
tileIndex - (smoothRange-tileSize)/2

Test for a smooth range that spans 3 tiles.

Parameters#

tileIndexint: Start index.
tileSizeint: Length of the tile.
smoothRangeint: Range over which to smooth.
maxPositionint: Maximum index allowed.

Returns#

tuple: (startIndex, endIndex)

Examples#

>>> c = CountReadsPerBin([], 1, 1, 1, 0)
>>> c.getSmoothRange(5, 1, 3, 10)
(4, 7)

Test smooth range truncated on start.

>>> c.getSmoothRange(0, 10, 30, 200)
(0, 2)

Test smooth range truncated on start.

>>> c.getSmoothRange(1, 10, 30, 4)
(0, 3)

Test smooth range truncated on end.

>>> c.getSmoothRange(5, 1, 3, 5)
(4, 5)

Test smooth range not multiple of tileSize.

>>> c.getSmoothRange(5, 10, 24, 10)
(4, 6)

sincei.ReadCounter.remove_row_of_zeros(matrix)[source]#: Remove all-zeros or all-nan rows from matrix.

sincei.ReadCounter.estimateSizeFactors(m)[source]#

Compute size factors in the same way as DESeq2. The inverse of that is returned, as it's then compatible with bamCoverage.

Parameters#

mnp.ndarray: Dataset with cells in rows and features in columns.

Examples#

>>> m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 10, 0], [10, 5, 100]])
>>> sf = estimateSizeFactors(m)
>>> assert(np.all(np.abs(sf - [1.305, 0.9932, 0.783]) < 1e-4))
>>> m = np.array([[0, 0], [0, 1], [1, 1], [1, 2]])
>>> sf = estimateSizeFactors(m)
>>> assert(np.all(np.abs(sf - [1.1892, 0.8409]) < 1e-4))

sincei.ReadCounter module

Contents

sincei.ReadCounter module#

Parameters#

Examples#

Returns#

Parameters#

Returns#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Returns#

Parameters#

Returns#

Examples#

Parameters#

Returns#

Examples#

Parameters#

Examples#