sincei.ReadCounter module#

class sincei.ReadCounter.CountReadsPerBin(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]#

Bases: object

Collects coverage over multiple bam files using multiprocessing

This class uses multiprocessing to compute the read counts (coverage) from several bam files and returns a numpy array with the resulting count matrix in feature x cell format and another numpy array containing the names of the features.

Parameters#

bamFilesListlist

List containing the paths of indexed bam files. E.g. ['file1.bam', 'file2.bam']

binLengthint

Length of the window/bin. This value is overruled by bedFile if present.

barcodeslist

list of barcodes to count the reads from.

numberOfSamplesint

Total number of samples. The genome is divided into numberOfSamples, each with a window/bin length equal to binLength. This value is overruled by stepSize and by bedFile in which case the number of samples are the regions in the bed file.

numberOfProcessorsint

Number of processors to use. Default: 4

verbosebool

Output messages. Default: False

regionstr

Region to limit the computation in the form chrom:start:end. Default: None

bedFilelist

List of file paths corresponding to bed files containing the regions for which to compute the coverage. This option overrules binLength, numberOfSamples and stepSize.

blackListFileNamestr

A string containing a BED file with blacklist regions.

extendReadsbool, int

Whether coverage should be computed for the extended read length (i.e. the region covered by the two mates or the regions expected to be covered by single-reads). If the value is 'int', then then this is interpreted as the fragment length to extend reads that are not paired. For Illumina reads, usual values are around 300. This value can be determined using the peak caller MACS2 or can be approximated by the fragment lengths computed when preparing the library for sequencing. If the value is of the variable is true and not value is given, the fragment size is sampled from the library but only if the library is paired-end. Default: False

minMappingQualityint

Reads of a mapping quality less than the give value are not considered. Default: None

duplicateFilterstr

Type of duplicate filter to use (same start, end position, umi and barcodes. If paired-end, same start-end for mates) are to be excluded. Default: None

chrToSkip: list

List with names of chromosomes that do not want to be included in the coverage computation. This is useful to remove unwanted chromosomes (e.g. 'random' or 'Het').

stepSizeint

the positions for which the coverage is computed are defined as follows: range(start, end, stepSize). Thus, a stepSize of 1, will compute the coverage at each base pair. If the stepSize is equal to the binLength then the coverage is computed for consecutive bins. If stepSize is smaller than the binLength, then bins will overlap.

center_readbool

Determines if reads should be centered with respect to the fragment length.

samFlag_includeint

Extracts only those reads having the SAM flag. For example, to get only reads that are the first mates a samFlag of 64 could be used. Similarly, the samFlag_include can be used to select only reads mapping on the reverse strand or to get only properly paired reads.

samFlag_excludeint

Removes reads that match the SAM flag. For example to get all reads that map to the forward strand a samFlag_exlude 16 should be used. Which translates into exclude all reads that map to the reverse strand.

zerosToNansbool

If true, zero values encountered are transformed to Nans. Default: False

skipZeroOverZerobool

If true, skip bins where all input BAM files have no coverage.

minFragmentLengthint

If greater than 0, fragments below this size are excluded.

maxFragmentLengthint

If greater than 0, fragments above this size are excluded.

minAlignedFractionfloat

fragments where less than the given fraction of bases align are excluded.

motifFilterlist

Only alignments with reads containing the given motif at 5'-end and genome 5'-end are counted.

GCcontentFilterlist

Only alignments with given min and max GC content are counted.

genome2bitstr

2 bit file for the genome (required if motifFilter is specified).

out_file_for_raw_datastr

File name to save the raw counts computed.

statsListlist

For each BAM file in bamFilesList, the associated per-chromosome statistics returned by openBam.

mappedListlist

For each BAM file in bamFilesList, the number of mapped reads in the file.

bed_and_binboolean

If true AND a bedFile is given, compute coverage of each bin of the given size in each region of bedFile.

sumCoveragePerBinboolean

If true return cumulative coverage per bin, instead of total read counts (for plotFingerPrint).

genomeChunkSizeint

If not None, the length of the genome used for multiprocessing.

Examples#

The test data contains reads for 200 bp.

>>> test = Tester()

The transpose function is used to get a nicer looking output. The first line corresponds to the number of reads per bin in bam file 1.

>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4)
>>> np.transpose(c.run())
array([[0., 0., 1., 1.],
       [0., 1., 1., 2.]])
run(allArgs=None)[source]#

Run the read counting according to the parameters specified when CountReadsPerBin is initialized.

Returns#

tuple [np.ndarray, np.ndarray]

A tuple containing a numpy array with the resulting count matrix in feature x cell format and another numpy array containing the names of the features.

count_reads_in_region(chrom, start, end, bed_regions_list=None)[source]#

Counts the reads in each bam file at each 'stepSize' position within the interval (start, end) for a window or bin of size binLength.

The stepSize controls the distance between bins. For example, a step size of 20 and a bin size of 20 will create bins next to each other. If the step size is smaller than the bin size the bins will overlap.

If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.

Parameters#

chromstr

Chrom name.

startint

start coordinate.

endint

end coordinate.

barcodes: list

List of barcodes to count (currently set for BAM tag 'BC').

bed_regions_list: list

List of list of tuples of the form (start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

Returns#

np.ndarray

The result is a numpy array that as rows each bin and as columns each bam file.

Examples#

Initialize some useful values

>>> test = Tester()
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 25, 0, stepSize=50)

The transpose is used to get better looking numbers. The first line corresponds to the number of reads per bin in the first bamfile.

>>> _array, __ = c.count_reads_in_region(test.chrom, 0, 200)
>>> _array
array([[0., 0.],
       [0., 1.],
       [1., 1.],
       [1., 2.]])
get_coverage_of_region(bamHandle, chrom, regions, fragmentFromRead_func=None)[source]#

Returns a numpy array that corresponds to the number of reads that overlap with each tile.

Parameters#

bamHandlepysam.AlignmentFile

An open pysam.AlignmentFile object representing the BAM file to read from.

chromstr

The name of the chromosome to analyze.

regionslist of tuples

A list of tuples specifying the regions to analyze. Each tuple should contain start and end positions.

fragmentFromRead_funcfunction, optional

A function to extract fragment information from a read. If not provided, the default method get_fragment_from_read is used.

Returns#

np.ndarray

Array of coverage values for each tile.

Examples#

Initialize some useful values

>>> test = Tester()
>>> import pysam
>>> c = CountReadsPerBin([], stepSize=1, extendReads=300)

For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested.

>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000833, 5000834), (5000834, 5000835)])
array([4., 5.])

In the following example a paired read is extended to the fragment length which is 100 The first mate starts at 5000000 and the second at 5000064. Each mate is extended to the fragment length independently At position 500090-500100 one fragment of length 100 overlap, and after position 5000101 there should be zero reads.

>>> c.zerosToNans = True
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000090, 5000100), (5000100, 5000110)])
array([ 1., nan])

In the following case the reads length is 50. Reads are not extended.

>>> c.extendReads=False
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', [(148, 150), (150, 152), (152, 154)])
array([1., 2., 2.])
getReadLength(read)[source]#

Returns the length of the read.

Parameters#

read : pysam read object

Returns#

int

Length of the read.

get_fragment_from_read(read)[source]#

Get read start and end position of a read. If given, the reads are extended as follows: If reads are paired end, each read mate is extended to match the fragment length, otherwise, a default fragment length is used. If reads are split (give by the CIGAR string) then the multiple positions of the read are returned. When reads are extended the cigar information is skipped.

The following values are defined (for forward reads):

     |--          -- read.tlen --              --|
     |-- read.alen --|
-----|===============>------------<==============|----
     |               |            |
read.reference_start
            read.reference_end  read.pnext

  and for reverse reads


     |--             -- read.tlen --           --|
                                 |-- read.alen --|
-----|===============>-----------<===============|----
     |                           |               |
  read.pnext           read.reference_start  read.reference_end

this is a sketch of a pair-end reads

The function returns the fragment start and end, either using the paired end information (if available) or extending the read in the appropriate direction if this is single-end.

Parameters#

read : pysam read object

Returns#

list of tuples

[(fragment start, fragment end)]

Examples#

>>> test = Tester()
>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True)
>>> c.defaultFragmentLength=100
>>> c.get_fragment_from_read(test.getRead("paired-forward"))
[(5000000, 5000100)]
>>> c.get_fragment_from_read(test.getRead("paired-reverse"))
[(5000000, 5000100)]
>>> c.defaultFragmentLength = 200
>>> c.get_fragment_from_read(test.getRead("single-forward"))
[(5001491, 5001691)]
>>> c.get_fragment_from_read(test.getRead("single-reverse"))
[(5001536, 5001736)]
>>> c.defaultFragmentLength = 'read length'
>>> c.get_fragment_from_read(test.getRead("single-forward"))
[(5001491, 5001527)]
>>> c.defaultFragmentLength = 'read length'
>>> c.extendReads = False
>>> c.get_fragment_from_read(test.getRead("paired-forward"))
[(5000000, 5000036)]

Tests for read centering.

>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True, center_read=True)
>>> c.defaultFragmentLength = 100
>>> assert(c.get_fragment_from_read(test.getRead("paired-forward")) == [(5000032, 5000068)])
>>> c.defaultFragmentLength = 200
>>> assert(c.get_fragment_from_read(test.getRead("single-reverse")) == [(5001618, 5001654)])
getSmoothRange(tileIndex, tileSize, smoothRange, maxPosition)[source]#

Given a tile index position and a tile size (length), return the a new indices over a larger range, called the smoothRange. This region is centered in the tileIndex an spans on both sizes to cover the smoothRange. The smoothRange is trimmed in case it is less than zero or greater than maxPosition

---------------|==================|------------------
           tileIndex
      |--------------------------------------|
      |    <--      smoothRange     -->      |
      |
tileIndex - (smoothRange-tileSize)/2

Test for a smooth range that spans 3 tiles.

Parameters#

tileIndexint

Start index.

tileSizeint

Length of the tile.

smoothRangeint

Range over which to smooth.

maxPositionint

Maximum index allowed.

Returns#

tuple

(startIndex, endIndex)

Examples#

>>> c = CountReadsPerBin([], 1, 1, 1, 0)
>>> c.getSmoothRange(5, 1, 3, 10)
(4, 7)

Test smooth range truncated on start.

>>> c.getSmoothRange(0, 10, 30, 200)
(0, 2)

Test smooth range truncated on start.

>>> c.getSmoothRange(1, 10, 30, 4)
(0, 3)

Test smooth range truncated on end.

>>> c.getSmoothRange(5, 1, 3, 5)
(4, 5)

Test smooth range not multiple of tileSize.

>>> c.getSmoothRange(5, 10, 24, 10)
(4, 6)
sincei.ReadCounter.remove_row_of_zeros(matrix)[source]#

Remove all-zeros or all-nan rows from matrix.

sincei.ReadCounter.estimateSizeFactors(m)[source]#

Compute size factors in the same way as DESeq2. The inverse of that is returned, as it's then compatible with bamCoverage.

Parameters#

mnp.ndarray

Dataset with cells in rows and features in columns.

Examples#

>>> m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 10, 0], [10, 5, 100]])
>>> sf = estimateSizeFactors(m)
>>> assert(np.all(np.abs(sf - [1.305, 0.9932, 0.783]) < 1e-4))
>>> m = np.array([[0, 0], [0, 1], [1, 1], [1, 2]])
>>> sf = estimateSizeFactors(m)
>>> assert(np.all(np.abs(sf - [1.1892, 0.8409]) < 1e-4))