sincei.ReadCounter module#
- class sincei.ReadCounter.CountReadsPerBin(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]#
Bases:
objectCollects coverage over multiple bam files using multiprocessing
This class uses multiprocessing to compute the read counts (coverage) from several bam files and returns a numpy array with the resulting count matrix in feature x cell format and another numpy array containing the names of the features.
Parameters#
- bamFilesListlist
List containing the paths of indexed bam files. E.g. ['file1.bam', 'file2.bam']
- binLengthint
Length of the window/bin. This value is overruled by
bedFileif present.- barcodeslist
list of barcodes to count the reads from.
- numberOfSamplesint
Total number of samples. The genome is divided into
numberOfSamples, each with a window/bin length equal tobinLength. This value is overruled bystepSizeand bybedFilein which case the number of samples are the regions in the bed file.- numberOfProcessorsint
Number of processors to use. Default: 4
- verbosebool
Output messages. Default: False
- regionstr
Region to limit the computation in the form chrom:start:end. Default: None
- bedFilelist
List of file paths corresponding to bed files containing the regions for which to compute the coverage. This option overrules
binLength,numberOfSamplesandstepSize.- blackListFileNamestr
A string containing a BED file with blacklist regions.
- extendReadsbool, int
Whether coverage should be computed for the extended read length (i.e. the region covered by the two mates or the regions expected to be covered by single-reads). If the value is 'int', then then this is interpreted as the fragment length to extend reads that are not paired. For Illumina reads, usual values are around 300. This value can be determined using the peak caller MACS2 or can be approximated by the fragment lengths computed when preparing the library for sequencing. If the value is of the variable is true and not value is given, the fragment size is sampled from the library but only if the library is paired-end. Default: False
- minMappingQualityint
Reads of a mapping quality less than the give value are not considered. Default: None
- duplicateFilterstr
Type of duplicate filter to use (same start, end position, umi and barcodes. If paired-end, same start-end for mates) are to be excluded. Default: None
- chrToSkip: list
List with names of chromosomes that do not want to be included in the coverage computation. This is useful to remove unwanted chromosomes (e.g. 'random' or 'Het').
- stepSizeint
the positions for which the coverage is computed are defined as follows:
range(start, end, stepSize). Thus, a stepSize of 1, will compute the coverage at each base pair. If the stepSize is equal to the binLength then the coverage is computed for consecutive bins. If stepSize is smaller than the binLength, then bins will overlap.- center_readbool
Determines if reads should be centered with respect to the fragment length.
- samFlag_includeint
Extracts only those reads having the SAM flag. For example, to get only reads that are the first mates a samFlag of 64 could be used. Similarly, the samFlag_include can be used to select only reads mapping on the reverse strand or to get only properly paired reads.
- samFlag_excludeint
Removes reads that match the SAM flag. For example to get all reads that map to the forward strand a samFlag_exlude 16 should be used. Which translates into exclude all reads that map to the reverse strand.
- zerosToNansbool
If true, zero values encountered are transformed to Nans. Default: False
- skipZeroOverZerobool
If true, skip bins where all input BAM files have no coverage.
- minFragmentLengthint
If greater than 0, fragments below this size are excluded.
- maxFragmentLengthint
If greater than 0, fragments above this size are excluded.
- minAlignedFractionfloat
fragments where less than the given fraction of bases align are excluded.
- motifFilterlist
Only alignments with reads containing the given motif at 5'-end and genome 5'-end are counted.
- GCcontentFilterlist
Only alignments with given min and max GC content are counted.
- genome2bitstr
2 bit file for the genome (required if motifFilter is specified).
- out_file_for_raw_datastr
File name to save the raw counts computed.
- statsListlist
For each BAM file in bamFilesList, the associated per-chromosome statistics returned by openBam.
- mappedListlist
For each BAM file in bamFilesList, the number of mapped reads in the file.
- bed_and_binboolean
If true AND a bedFile is given, compute coverage of each bin of the given size in each region of bedFile.
- sumCoveragePerBinboolean
If true return cumulative coverage per bin, instead of total read counts (for plotFingerPrint).
- genomeChunkSizeint
If not None, the length of the genome used for multiprocessing.
Examples#
The test data contains reads for 200 bp.
>>> test = Tester()
The transpose function is used to get a nicer looking output. The first line corresponds to the number of reads per bin in bam file 1.
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4) >>> np.transpose(c.run()) array([[0., 0., 1., 1.], [0., 1., 1., 2.]])
- run(allArgs=None)[source]#
Run the read counting according to the parameters specified when CountReadsPerBin is initialized.
Returns#
- tuple [np.ndarray, np.ndarray]
A tuple containing a numpy array with the resulting count matrix in feature x cell format and another numpy array containing the names of the features.
- count_reads_in_region(chrom, start, end, bed_regions_list=None)[source]#
Counts the reads in each bam file at each 'stepSize' position within the interval (start, end) for a window or bin of size binLength.
The stepSize controls the distance between bins. For example, a step size of 20 and a bin size of 20 will create bins next to each other. If the step size is smaller than the bin size the bins will overlap.
If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.
Parameters#
- chromstr
Chrom name.
- startint
start coordinate.
- endint
end coordinate.
- barcodes: list
List of barcodes to count (currently set for BAM tag 'BC').
- bed_regions_list: list
List of list of tuples of the form (start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.
Returns#
- np.ndarray
The result is a numpy array that as rows each bin and as columns each bam file.
Examples#
Initialize some useful values
>>> test = Tester() >>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 25, 0, stepSize=50)
The transpose is used to get better looking numbers. The first line corresponds to the number of reads per bin in the first bamfile.
>>> _array, __ = c.count_reads_in_region(test.chrom, 0, 200) >>> _array array([[0., 0.], [0., 1.], [1., 1.], [1., 2.]])
- get_coverage_of_region(bamHandle, chrom, regions, fragmentFromRead_func=None)[source]#
Returns a numpy array that corresponds to the number of reads that overlap with each tile.
Parameters#
- bamHandlepysam.AlignmentFile
An open pysam.AlignmentFile object representing the BAM file to read from.
- chromstr
The name of the chromosome to analyze.
- regionslist of tuples
A list of tuples specifying the regions to analyze. Each tuple should contain start and end positions.
- fragmentFromRead_funcfunction, optional
A function to extract fragment information from a read. If not provided, the default method
get_fragment_from_readis used.
Returns#
- np.ndarray
Array of coverage values for each tile.
Examples#
Initialize some useful values
>>> test = Tester() >>> import pysam >>> c = CountReadsPerBin([], stepSize=1, extendReads=300)
For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested.
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2', ... [(5000833, 5000834), (5000834, 5000835)]) array([4., 5.])
In the following example a paired read is extended to the fragment length which is 100 The first mate starts at 5000000 and the second at 5000064. Each mate is extended to the fragment length independently At position 500090-500100 one fragment of length 100 overlap, and after position 5000101 there should be zero reads.
>>> c.zerosToNans = True >>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2', ... [(5000090, 5000100), (5000100, 5000110)]) array([ 1., nan])
In the following case the reads length is 50. Reads are not extended.
>>> c.extendReads=False >>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', [(148, 150), (150, 152), (152, 154)]) array([1., 2., 2.])
- getReadLength(read)[source]#
Returns the length of the read.
Parameters#
read : pysam read object
Returns#
- int
Length of the read.
- get_fragment_from_read(read)[source]#
Get read start and end position of a read. If given, the reads are extended as follows: If reads are paired end, each read mate is extended to match the fragment length, otherwise, a default fragment length is used. If reads are split (give by the CIGAR string) then the multiple positions of the read are returned. When reads are extended the cigar information is skipped.
The following values are defined (for forward reads):
|-- -- read.tlen -- --| |-- read.alen --| -----|===============>------------<==============|---- | | | read.reference_start read.reference_end read.pnext and for reverse reads |-- -- read.tlen -- --| |-- read.alen --| -----|===============>-----------<===============|---- | | | read.pnext read.reference_start read.reference_end
this is a sketch of a pair-end reads
The function returns the fragment start and end, either using the paired end information (if available) or extending the read in the appropriate direction if this is single-end.
Parameters#
read : pysam read object
Returns#
- list of tuples
[(fragment start, fragment end)]
Examples#
>>> test = Tester() >>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True) >>> c.defaultFragmentLength=100 >>> c.get_fragment_from_read(test.getRead("paired-forward")) [(5000000, 5000100)] >>> c.get_fragment_from_read(test.getRead("paired-reverse")) [(5000000, 5000100)] >>> c.defaultFragmentLength = 200 >>> c.get_fragment_from_read(test.getRead("single-forward")) [(5001491, 5001691)] >>> c.get_fragment_from_read(test.getRead("single-reverse")) [(5001536, 5001736)] >>> c.defaultFragmentLength = 'read length' >>> c.get_fragment_from_read(test.getRead("single-forward")) [(5001491, 5001527)] >>> c.defaultFragmentLength = 'read length' >>> c.extendReads = False >>> c.get_fragment_from_read(test.getRead("paired-forward")) [(5000000, 5000036)]
Tests for read centering.
>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True, center_read=True) >>> c.defaultFragmentLength = 100 >>> assert(c.get_fragment_from_read(test.getRead("paired-forward")) == [(5000032, 5000068)]) >>> c.defaultFragmentLength = 200 >>> assert(c.get_fragment_from_read(test.getRead("single-reverse")) == [(5001618, 5001654)])
- getSmoothRange(tileIndex, tileSize, smoothRange, maxPosition)[source]#
Given a tile index position and a tile size (length), return the a new indices over a larger range, called the smoothRange. This region is centered in the tileIndex an spans on both sizes to cover the smoothRange. The smoothRange is trimmed in case it is less than zero or greater than maxPosition
---------------|==================|------------------ tileIndex |--------------------------------------| | <-- smoothRange --> | | tileIndex - (smoothRange-tileSize)/2
Test for a smooth range that spans 3 tiles.
Parameters#
- tileIndexint
Start index.
- tileSizeint
Length of the tile.
- smoothRangeint
Range over which to smooth.
- maxPositionint
Maximum index allowed.
Returns#
- tuple
(startIndex, endIndex)
Examples#
>>> c = CountReadsPerBin([], 1, 1, 1, 0) >>> c.getSmoothRange(5, 1, 3, 10) (4, 7)
Test smooth range truncated on start.
>>> c.getSmoothRange(0, 10, 30, 200) (0, 2)
Test smooth range truncated on start.
>>> c.getSmoothRange(1, 10, 30, 4) (0, 3)
Test smooth range truncated on end.
>>> c.getSmoothRange(5, 1, 3, 5) (4, 5)
Test smooth range not multiple of tileSize.
>>> c.getSmoothRange(5, 10, 24, 10) (4, 6)
- sincei.ReadCounter.remove_row_of_zeros(matrix)[source]#
Remove all-zeros or all-nan rows from matrix.
- sincei.ReadCounter.estimateSizeFactors(m)[source]#
Compute size factors in the same way as DESeq2. The inverse of that is returned, as it's then compatible with bamCoverage.
Parameters#
- mnp.ndarray
Dataset with cells in rows and features in columns.
Examples#
>>> m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 10, 0], [10, 5, 100]]) >>> sf = estimateSizeFactors(m) >>> assert(np.all(np.abs(sf - [1.305, 0.9932, 0.783]) < 1e-4)) >>> m = np.array([[0, 0], [0, 1], [1, 1], [1, 2]]) >>> sf = estimateSizeFactors(m) >>> assert(np.all(np.abs(sf - [1.1892, 0.8409]) < 1e-4))