sincei package
Submodules
sincei.ExponentialFamily module
- class sincei.ExponentialFamily.Bernoulli(family_params=None, **kwargs)[source]
Bases:
ExponentialFamily
Bernoulli distribution
- family_params of interest:
“max_val” (int) corresponding to the max value (replaces infinity).
Empirically, values above 10 yield similar results.
- class sincei.ExponentialFamily.Beta(family_params=None, **kwargs)[source]
Bases:
ExponentialFamily
Beta distribution, using a standard formulation.
Original formulation presented in [Mourragui et al, 2023].
- family_params of interest:
“min_val” (int): min data value (replaces 0 and 1).
“n_jobs” (int): number of jobs, specifically
for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.
- class sincei.ExponentialFamily.ExponentialFamily(family_params=None, **kwargs)[source]
Bases:
object
Encodes an exponential family distribution using PyTorch autodiff structures.
ExponentialFamily corresponds to the superclass providing a backbone for all exponential family. Each subclass should contain the following methods, defined based on the distribution of choice (same notation as [Mourragui et al, 2023]):
sufficient_statistics ($T$),
natural_parametrization ($eta$),
log_partition ($A$).
invert_g ($g^{-1}$).
initialize_family_parameters: computes parameters used in other
methods, e.g., gene-level dispersion for Negative Binomial.
We added a “base_measure” for sake of completeness, but this method is not necessary for running GLM-PCA. The log-likelihood and exponential term are defined directly from the aforementionned methods.
Parameters
- family_nameint
Name of the family.
- initialize_family_parameters(X: Optional[Tensor] = None)[source]
General method to initialize certain parameters (e.g. for Beta or Negative Binomial
- class sincei.ExponentialFamily.Gamma(family_params=None, **kwargs)[source]
Bases:
ExponentialFamily
Gamma distribution using a standard formulation.
Original formulation presented in [Mourragui et al, 2023].
- family_params of interest:
“min_val” (int): min data value, default to 1e-5.
“max_val” (int): max data value, default to 10e6.
“n_jobs” (int): number of jobs, specifically
for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.
- class sincei.ExponentialFamily.Gaussian(family_params=None, **kwargs)[source]
Bases:
ExponentialFamily
Gaussian with standard deviation one.
GLMPCA with Gaussian as family is equivalent to the standard PCA.
- class sincei.ExponentialFamily.LogNormal(family_params=None, **kwargs)[source]
Bases:
ExponentialFamily
Log-normal distribution using a standard formulation.
Original formulation presented in [Mourragui et al, 2023].
- family_params of interest:
“min_val” (int): min data value, default to 1e-5.
“max_val” (int): max data value, default to 10e6.
“n_jobs” (int): number of jobs, specifically
for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.
- class sincei.ExponentialFamily.Poisson(family_params=None, **kwargs)[source]
Bases:
ExponentialFamily
Poisson distribution
- family_params of interest:
“min_val” (int) corresponding to the min value (replaces 0).
- class sincei.ExponentialFamily.SigmoidBeta(family_params=None, **kwargs)[source]
Bases:
Beta
Beta distribution re-parametrized using a Sigmoid.
This distribution is similar to the previous Beta (which it inherits from) but the natural parameter is re-parametrized using a Sigmoid. This is shown expeerimentally to stabilize the optimisation by removing the ]0,1[ constraint.
- family_params of interest:
“min_val” (int): min data value (replaces 0 and 1).
“n_jobs” (int): number of jobs, specifically
for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.
sincei.FragmentFFT module
- sincei.FragmentFFT.fftplot(outdict, plot)[source]
Computes the periodicity of the fragment distribution
Parameters
- outdictdict
Dictionary containing the fragment distribution for each barcode
- plotstr
File name to save the plot
Returns
- dict
Dictionary containing the periodicity of the fragment distribution for each barcode
Examples
>>> test = Tester() >>> d = test.get_fragment_distribution() >>> fftplot(d, 'test.png')
- sincei.FragmentFFT.ffttable(selected)[source]
Computes the FFT of the fragment length distribution
Parameters
- selectednumpy array
Array of fragment lengths
Returns
- pandas dataframe
Dataframe with two columns. The first column contains the frequencies and the second column the power spectrum.
Examples
>>> test = Tester() >>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4) >>> num_reads_per_bin, regionList = c.run() >>> selected = num_reads_per_bin[:,0] >>> ffttable(selected) freq value 0 0.0 7.812500e+06 1 1.0 1.812500e+06 2 2.0 1.812500e+06 3 3.0 1.812500e+06 4 4.0 1.812500e+06 5 5.0 1.812500
- sincei.FragmentFFT.fragment_distribution(fragment_len_dict, length_plot)[source]
Plot fragment length distribution
Parameters
- fragment_len_dictdict
Dictionary containing the fragment length for each barcode.
- length_plotstr
File name to save the plot.
Returns
- dict
Dictionary containing the top 20 frequencies for each barcode.
Examples
>>> test = Tester() >>> fragment_len_dict = {'AAACCTGAGAGGTTCT': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000], ... 'AAACCTGAGAGGTTCT': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]} >>> fragment_distribution(fragment_len_dict, test.length_plot)
sincei.GLMPCA module
- class sincei.GLMPCA.GLMPCA(n_pc, family='gaussian', family_params=None, max_iter=100, learning_rate=0.2, batch_size=256, step_size=20, gamma=0.5, n_init=1, init='spectral', n_jobs=1)[source]
Bases:
object
Performs GLM-PCA on a data matrix to reduce dimensionality
This class computes the generalized-linear model principal components (GLM-PC) of a dataset by exploiting the framework of saturated parameters. Specifically, given an exponential family distribution choosen following some prior knowledge, GLM-PCA will find a collection of directions which maximise the reconstruction error, computed as the negative log-likelihood of the chosen exponential family.
By making use of an alternative formulation, our implementation can exploit automatic differentiation and can therefore rely on mini-batch Stochastic Gradient Descent. As a consequence, it scales to very large dataset.
Another interesting feature of our implementation is that it does not require cumbersome Lagrangian derivations. If you wish to test an exponential family distribution not present in our implementation, adding a class in ExponentialFamily with the different density functionals defined there would suffice to use GLM-PCA.
Parameters
- n_pcint
Number of principal components.
- family: str
Name of the exponential family distribution. Possible families: “gaussian”, “poisson”, “bernoulli”, “beta”, “gamma”, “log_normal”, “log_beta”:, “sigmoid_beta”. Default to “gaussian”
- family_paramsdict
Dictionary with family parameters to be added. List of parameters depend on the specific ExponentialFamily class chosen. Examples:
“n_jobs” (int) for parallelization, specifically for “beta” and “gamma”.
“min_val” (float) for truncating in “poisson” or “beta”.
“eps” (float) for convergence in inverse computation in “beta”.
Default to None.
- max_iterint
Maximum number of epochs in the GLM-PCA optimisation. Default to 100.
- learning_rate: float
Learning rate to be used in the GLM-PCA optimisation. If learning_rate is too high and lead to NaN, our implementation automatically restarts the optimisation with a smaller value. Default to 0.2.
- batch_sizeint
Size of the batch in the SGD optimisation step. Default to 256.
- step_size: int
Step size in optimiser scheduler. See more: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html Default to 20.
- gamma: int
Reduction parameter for optimiser scheduler. See more: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html Default to 0.5
- n_init: int
Number of GLM-PCA initializations. Useful if you want to explore different random seeds and starting points. Default to 1.
- init: str
Method to initialize loadings. “spectral” performs SVD on the saturated parameters from a small random batch of the dataset, “random” performs a random initialization on the Stiefel manifold. Default to “spectral”.
- n_jobs: int
Number of jobs used in parallel operations. Default to 1.
sincei.GetStats module
- sincei.GetStats.getStats_worker(arglist)[source]
Computes statistics for each read in a bam file
This function computes statistics for each read in a bam file.
Parameters
- bamfileslist
List containing the names of indexed bam files. E.g. [‘file1.bam’, ‘file2.bam’]
- binSizeint
Length of the window/bin. This value is overruled by
bedFile
if present.- barcodeslist
list of barcodes to count the reads from.
- numberOfSamplesint
Total number of samples. The genome is divided into
numberOfSamples
, each with a window/bin length equal tobinLength
. This value is overruled bystepSize
in case such value is present and bybedFile
in which case the number of samples and bins are defined in the bed file- numberOfProcessorsint
Number of processors to use. Default is 4
- verbosebool
Output messages. Default: False
- regionstr
Region to limit the computation in the form chrom:start
sincei.ParserCommon module
- sincei.ParserCommon.bamOptions(args=None, suppress_args=None, default_opts=None)[source]
Arguments for tools that operate on BAM files
- sincei.ParserCommon.inputOutputOptions(args=None, opts=None, requiredOpts=[], suppress_args=None)[source]
- sincei.ParserCommon.readOptions(args=None, suppress_args=None)[source]
Common arguments related to BAM files and the interpretation of the read coverage
sincei.ReadCounter module
- class sincei.ReadCounter.CountReadsPerBin(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]
Bases:
object
Collects coverage over multiple bam files using multiprocessing
This function collects read counts (coverage) from several bam files and returns an numpy array with the results. This class uses multiprocessing to compute the coverage.
Parameters
- bamFilesListlist
List containing the names of indexed bam files. E.g. [‘file1.bam’, ‘file2.bam’]
- binLengthint
Length of the window/bin. This value is overruled by
bedFile
if present.- barcodeslist
list of barcodes to count the reads from.
- numberOfSamplesint
Total number of samples. The genome is divided into
numberOfSamples
, each with a window/bin length equal tobinLength
. This value is overruled bystepSize
in case such value is present and bybedFile
in which case the number of samples and bins are defined in the bed file- numberOfProcessorsint
Number of processors to use. Default is 4
- verbosebool
Output messages. Default: False
- regionstr
Region to limit the computation in the form chrom:start:end.
- bedFilelist of file_handles.
Each file handle corresponds to a bed file containing the regions for which to compute the coverage. This option overrules
binLength
,numberOfSamples
andstepSize
.- blackListFileNamestr
A string containing a BED file with blacklist regions.
extendReads : bool, int
Whether coverage should be computed for the extended read length (i.e. the region covered by the two mates or the regions expected to be covered by single-reads). If the value is ‘int’, then then this is interpreted as the fragment length to extend reads that are not paired. For Illumina reads, usual values are around 300. This value can be determined using the peak caller MACS2 or can be approximated by the fragment lengths computed when preparing the library for sequencing. If the value is of the variable is true and not value is given, the fragment size is sampled from the library but only if the library is paired-end. Default: False
- minMappingQualityint
Reads of a mapping quality less than the give value are not considered. Default: None
- duplicateFilterstr
Type of duplicate filter to use (same start, end position, umi and barcodes. If paired-end, same start-end for mates) are to be excluded. Default: None
- chrToSkip: list
List with names of chromosomes that do not want to be included in the coverage computation. This is useful to remove unwanted chromosomes (e.g. ‘random’ or ‘Het’).
- stepSizeint
the positions for which the coverage is computed are defined as follows:
range(start, end, stepSize)
. Thus, a stepSize of 1, will compute the coverage at each base pair. If the stepSize is equal to the binLength then the coverage is computed for consecutive bins. If seepSize is smaller than the binLength, then teh bins will overlap.- center_readbool
Determines if reads should be centered with respect to the fragment length.
- samFlag_includeint
Extracts only those reads having the SAM flag. For example, to get only reads that are the first mates a samFlag of 64 could be used. Similarly, the samFlag_include can be used to select only reads mapping on the reverse strand or to get only properly paired reads.
- samFlag_excludeint
Removes reads that match the SAM flag. For example to get all reads that map to the forward strand a samFlag_exlude 16 should be used. Which translates into exclude all reads that map to the reverse strand.
- zerosToNansbool
If true, zero values encountered are transformed to Nans. Default false.
- skipZeroOverZerobool
If true, skip bins where all input BAM files have no coverage (only applicable to bamCompare).
- minFragmentLengthint
If greater than 0, fragments below this size are excluded.
- maxFragmentLengthint
If greater than 0, fragments above this size are excluded.
- minAlignedFractionfloat
fragments where less than the given fraction of bases align are excluded.
- motifFilterlist
Only alignments with reads containing the given motif at 5’-end and genome 5’-end are counted.
- GCcontentFilterlist
Only alignments with given min and max GC content are counted.
- genome2bitstr
2 bit file for the genome (if motifFilter is specified)
- out_file_for_raw_datastr
File name to save the raw counts computed
- statsListlist
For each BAM file in bamFilesList, the associated per-chromosome statistics returned by openBam
- mappedListlist
For each BAM file in bamFilesList, the number of mapped reads in the file.
- bed_and_binboolean
If true AND a bedFile is given, compute coverage of each bin of the given size in each region of bedFile
- sumCoveragePerBinboolean
If true return cumulative coverage per bin, instead of total read counts (for plotFingerPrint)
- genomeChunkSizeint
If not None, the length of the genome used for multiprocessing.
Returns
numpy array
Each row correspond to each bin/bed region and each column correspond to each of the bamFiles.
Examples
The test data contains reads for 200 bp.
>>> test = Tester()
The transpose function is used to get a nicer looking output. The first line corresponds to the number of reads per bin in bam file 1
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4) >>> np.transpose(c.run()) array([[0., 0., 1., 1.], [0., 1., 1., 2.]])
- count_reads_in_region(chrom, start, end, bed_regions_list=None)[source]
Counts the reads in each bam file at each ‘stepSize’ position within the interval (start, end) for a window or bin of size binLength.
The stepSize controls the distance between bins. For example, a step size of 20 and a bin size of 20 will create bins next to each other. If the step size is smaller than the bin size the bins will overlap.
If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.
Parameters
- chromstr
Chrom name
- startint
start coordinate
- endint
end coordinate
- barcodes: list
List of barcodes to count (currently set for tag ‘BC’ in the BAM)
- bed_regions_list: list
List of list of tuples of the form (start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.
Returns
- numpy array
The result is a numpy array that as rows each bin and as columns each bam file.
Examples
Initialize some useful values
>>> test = Tester() >>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 25, 0, stepSize=50)
The transpose is used to get better looking numbers. The first line corresponds to the number of reads per bin in the first bamfile.
>>> _array, __ = c.count_reads_in_region(test.chrom, 0, 200) >>> _array array([[0., 0.], [0., 1.], [1., 1.], [1., 2.]])
- getSmoothRange(tileIndex, tileSize, smoothRange, maxPosition)[source]
Given a tile index position and a tile size (length), return the a new indices over a larger range, called the smoothRange. This region is centered in the tileIndex an spans on both sizes to cover the smoothRange. The smoothRange is trimmed in case it is less than zero or greater than maxPosition
---------------|==================|------------------ tileStart |--------------------------------------| | <-- smoothRange --> | | tileStart - (smoothRange-tileSize)/2
Test for a smooth range that spans 3 tiles.
Examples
>>> c = CountReadsPerBin([], 1, 1, 1, 0) >>> c.getSmoothRange(5, 1, 3, 10) (4, 7)
Test smooth range truncated on start.
>>> c.getSmoothRange(0, 10, 30, 200) (0, 2)
Test smooth range truncated on start.
>>> c.getSmoothRange(1, 10, 30, 4) (0, 3)
Test smooth range truncated on end.
>>> c.getSmoothRange(5, 1, 3, 5) (4, 5)
Test smooth range not multiple of tileSize.
>>> c.getSmoothRange(5, 10, 24, 10) (4, 6)
- get_coverage_of_region(bamHandle, chrom, regions, fragmentFromRead_func=None)[source]
Returns a numpy array that corresponds to the number of reads that overlap with each tile.
>>> test = Tester() >>> import pysam >>> c = CountReadsPerBin([], stepSize=1, extendReads=300)
For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested.
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2', ... [(5000833, 5000834), (5000834, 5000835)]) array([4., 5.])
In the following example a paired read is extended to the fragment length which is 100 The first mate starts at 5000000 and the second at 5000064. Each mate is extended to the fragment length independently At position 500090-500100 one fragment of length 100 overlap, and after position 5000101 there should be zero reads.
>>> c.zerosToNans = True >>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2', ... [(5000090, 5000100), (5000100, 5000110)]) array([ 1., nan])
In the following case the reads length is 50. Reads are not extended.
>>> c.extendReads=False >>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', [(148, 150), (150, 152), (152, 154)]) array([1., 2., 2.])
- get_fragment_from_read(read)[source]
Get read start and end position of a read. If given, the reads are extended as follows: If reads are paired end, each read mate is extended to match the fragment length, otherwise, a default fragment length is used. If reads are split (give by the CIGAR string) then the multiple positions of the read are returned. When reads are extended the cigar information is skipped.
Parameters
read: pysam object.
The following values are defined (for forward reads):
|-- -- read.tlen -- --| |-- read.alen --| -----|===============>------------<==============|---- | | | read.reference_start read.reference_end read.pnext and for reverse reads |-- -- read.tlen -- --| |-- read.alen --| -----|===============>-----------<===============|---- | | | read.pnext read.reference_start read.reference_end
this is a sketch of a pair-end reads
The function returns the fragment start and end, either using the paired end information (if available) or extending the read in the appropriate direction if this is single-end.
Parameters
read : pysam read object
Returns
- list of tuples
[(fragment start, fragment end)]
>>> test = Tester() >>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True) >>> c.defaultFragmentLength=100 >>> c.get_fragment_from_read(test.getRead("paired-forward")) [(5000000, 5000100)] >>> c.get_fragment_from_read(test.getRead("paired-reverse")) [(5000000, 5000100)] >>> c.defaultFragmentLength = 200 >>> c.get_fragment_from_read(test.getRead("single-forward")) [(5001491, 5001691)] >>> c.get_fragment_from_read(test.getRead("single-reverse")) [(5001536, 5001736)] >>> c.defaultFragmentLength = 'read length' >>> c.get_fragment_from_read(test.getRead("single-forward")) [(5001491, 5001527)] >>> c.defaultFragmentLength = 'read length' >>> c.extendReads = False >>> c.get_fragment_from_read(test.getRead("paired-forward")) [(5000000, 5000036)]
Tests for read centering.
>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True, center_read=True) >>> c.defaultFragmentLength = 100 >>> assert(c.get_fragment_from_read(test.getRead("paired-forward")) == [(5000032, 5000068)]) >>> c.defaultFragmentLength = 200 >>> assert(c.get_fragment_from_read(test.getRead("single-reverse")) == [(5001618, 5001654)])
- static is_proper_pair(read, maxPairedFragmentLength)[source]
Checks if a read is proper pair meaning that both mates are facing each other and are in the same chromosome and are not to far away. The sam flag for proper pair can not always be trusted. Note that if the fragment size is > maxPairedFragmentLength (~2kb usually) that False will be returned. :return: bool
>>> import pysam >>> import os >>> from deeptools.countReadsPerBin import CountReadsPerBin as cr >>> root = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/" >>> bam = pysam.AlignmentFile("{}/test_proper_pair_filtering.bam".format(root)) >>> iter = bam.fetch() >>> read = next(iter) >>> cr.is_proper_pair(read, 1000) # "keep" read True >>> cr.is_proper_pair(read, 200) # "keep" read, but maxPairedFragmentLength is too short False >>> read = next(iter) >>> cr.is_proper_pair(read, 1000) # "improper pair" False >>> read = next(iter) >>> cr.is_proper_pair(read, 1000) # "mismatch chr" False >>> read = next(iter) >>> cr.is_proper_pair(read, 1000) # "same orientation1" False >>> read = next(iter) >>> cr.is_proper_pair(read, 1000) # "same orientation2" False >>> read = next(iter) >>> cr.is_proper_pair(read, 1000) # "rev first" False >>> read = next(iter) >>> cr.is_proper_pair(read, 1000) # "rev first OK" True >>> read = next(iter) >>> cr.is_proper_pair(read, 1000) # "for first" False >>> read = next(iter) >>> cr.is_proper_pair(read, 1000) # "for first" True
- sincei.ReadCounter.countReadsInRegions_wrapper(args)[source]
Passes the arguments to countReadsInRegions_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the countReadsPerBin object
- sincei.ReadCounter.estimateSizeFactors(m)[source]
Compute size factors in the same way as DESeq2. The inverse of that is returned, as it’s then compatible with bamCoverage.
m : a numpy ndarray
>>> m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 10, 0], [10, 5, 100]]) >>> sf = estimateSizeFactors(m) >>> assert(np.all(np.abs(sf - [1.305, 0.9932, 0.783]) < 1e-4)) >>> m = np.array([[0, 0], [0, 1], [1, 1], [1, 2]]) >>> sf = estimateSizeFactors(m) >>> assert(np.all(np.abs(sf - [1.1892, 0.8409]) < 1e-4))
sincei.RegionQuery module
- sincei.RegionQuery.get_bins_by_gene(dict, gene, firstBin=False)[source]
Returns the bins for a given gene.
Parameters
- dictdict
Dictionary of bins and genes.
- genestr
Gene name.
- firstBinbool
If true, return only the first bin of the gene.
Returns
- list
List of bins.
Examples
>>> dict = {'chr1_1': [('gene1', '+'), ('gene2', '-')], 'chr1_2': [('gene1', '+')]} >>> get_bins_by_gene(dict, 'gene1') ['chr1_1', 'chr1_2'] >>> get_bins_by_gene(dict, 'gene1', firstBin=True) 'chr1_1'
- sincei.RegionQuery.get_gtf_adata_olaps(adata, gtf)[source]
Get overlaps between adata and gtf
Parameters
- adataAnnData
Annotated data matrix.
- gtfGTF
GTF object
Returns
- dict
Dictionary with overlaps for each gene in adata.
Examples
>>> test = Tester() >>> gtf = GTF(test.gtfFile) >>> adata = sc.read_10x_mtx(test.input_matrix_dir, var_names='gene_symbols', cache=True) >>> olaps=get_gtf_adata_olaps(adata, gtf) >>> olaps['Gm37381'] [('ENSMUSG00000064372', '+'), ('ENSMUSG00000064372', '-')]
sincei.TopicModels module
- class sincei.TopicModels.TOPICMODEL(mat, cells, regions, n_topics, smart_code='lfu', n_passes=1, n_workers=1)[source]
Bases:
object
Computes LSA or LDA for a given matrix and returns the cell-topic matrix
Parameters
- matscipy.sparse.csr_matrix
Sparse matrix of shape (cells, regions)
- cellslist
List of Cell IDs (corresponding to the input matrix rows)
- regionslist
List of Regions (corresponding to the input matrix columns)
- n_topicsint
Number of Topics / Principal Components
- smart_codestr
SMART (System for the Mechanical Analysis and Retrieval of Text) code for weighting of input matrix for TFIDF. Only valid for the LSA model. The default (“lfu”) corresponds to “log”TF * IDF, and “pivoted unique” normalization of document length. For more information, see: https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System
Returns
An object of class TOPICMODELS.
- get_cell_topic(pop_sparse_cells=False)[source]
Get cell-topic matrix from the updated object
Returns
- cell_topicpandas dataframe
Cell-topic matrix
sincei.Utilities module
- sincei.Utilities.checkAlignedFraction(read, lowFilter)[source]
Check whether the fraction of read length that aligns to the reference is higher than the given threshold. Aligned fraction includes the max allowed mismatches tolerated by the aligner, and excludes InDels and Clippings.
Return: Bool
- sincei.Utilities.checkGCcontent(read, lowFilter, highFilter, returnGC=False)[source]
Checks if the GC content of the read is within the given range
Parameters
- readpysam.AlignedSegment
A pysam AlignedSegment object
- lowFilterfloat
Minimum GC content
- highFilterfloat
Maximum GC content
- returnGCbool
If true, return the GC content of the read
Returns
- bool
True if the GC content of the read is within the given range
Examples
>>> test = Tester() >>> read = test.bamFile1.fetch().next() >>> checkGCcontent(read, 0.3, 0.7) True
- sincei.Utilities.checkMotifs(read, chrom, genome, readMotif, refMotif)[source]
Check whether a given motif is present in the read and the corresponding reference genome. For example, in MNAse (scChIC-seq) data, we expect the reads to have an ‘A’ at the 5’-end, while the genome has a ‘TA’ over hang (where the ‘A’ aligns with ‘A’ in the forward read), like this below.
Forwards aligned read: read has ‘A’, upstream has T R1 ……..A——-> ———-TA————Ref (+)
Rev aligned read: read has ‘T’, downstream has A
<——-T……. R1 ——–TA————Ref (+)
This function can look for any arbitrary motif in read and corresponding genome, but in the same orientation as described above.
- Returns:
bool
>>> import pysam >>> import os >>> from scDeepTools.scReadCounter import CountReadsPerBin as cr >>> root = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/" >>> bam = pysam.AlignmentFile("{}/test_TA_filtering.bam".format(root)) >>> iter = bam.fetch() >>> read = next(iter) >>> cr.checkMotifs(read, 'A', 'TA') # for valid scChIC read True >>> read = next(iter) >>> cr.checkMotifs(read, 'A', 'TA') # for invalid scChIC read False
- sincei.Utilities.colorPicker(name)[source]
This function returns a list of colors for plotting.
Parameters
- namestr
The name of the color palette to use.
Returns
- list
A list of colors.
Examples
>>> colorPicker('twentyfive') ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00', '#ffff33', '#a65628', '#f781bf', '#999999']
>>> colorPicker('colorblind') ['#0072B2', '#009E73', '#D55E00', '#CC79A7', '#F0E442', '#56B4E9', '#E69F00', '#000000']
- sincei.Utilities.getDupFilterTuple(read, bc, filterArg)[source]
Returns a tuple with the information needed to filter duplicates, based on read and filter type. The tuple is composed of the barcode, the umi, the start and end positions and the chromosome name.
Parameters
- readpysam.AlignedSegment
A pysam.AlignedSegment object
- bcstr
The barcode
- filterstr
A string with the type of filter to use.
Returns
tuple
A tuple with the information needed to filter duplicates. The tuple is composed of the barcode, the umi, the start and end positions and the chromosome name.
Examples
>>> test = Tester() >>> read = test.bamFile1.fetch().next() >>> getDupFilterTuple(read, 'ATCG', 'end_umi') ('ATCG', 'ATCG', None, None, 0, False)
- sincei.Utilities.gini(i, X)[source]
Computes the Gini coefficient for each row of a sparse matrix (Obs*Var).
Parameters
- iint
row index
- Xnumpy array
matrix
Returns
- float
Gini coefficient for the given row
Examples
>>> X = np.matrix([[1,2,3,4],[5,6,7,8],[9,10,11,12]]) >>> gini(0, X) 0.0 >>> gini(1, X) 0.0 >>> gini(2, X) 0.0
sincei.WriteBedGraph module
- class sincei.WriteBedGraph.WriteBedGraph(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]
Bases:
CountReadsPerBin
Reads bam files coverages and writes a bedgraph or bigwig file
Extends the CountReadsPerBin object such that the coverage of bam files is writen to multiple bedgraph files at once.
The bedgraph files are later merge into one and converted into a bigwig file if necessary.
The constructor arguments are the same as for CountReadsPerBin. However, when calling the run method, the following parameters have to be passed
Examples
Given the following distribution of reads that cover 200 on a chromosome named ‘3R’:
0 100 200 |------------------------------------------------------------| A =============== =============== B =============== =============== =============== ===============
>>> import tempfile >>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> outFile = tempfile.NamedTemporaryFile() >>> bam_file = test_path + "testA.bam"
For the example a simple scaling function is going to be used. This function takes the coverage found at each region and multiplies it to the scaling factor. In this case the scaling factor is 1.5
>>> function_to_call = scaleCoverage >>> funcArgs = {'scaleFactor': 1.5}
Restrict process to a region between positions 0 and 200 of chromosome 3R
>>> region = '3R:0:200'
Set up such that coverage is computed for consecutive bins of length 25 bp >>> bin_length = 25 >>> step_size = 25
>>> num_sample_sites = 0 #overruled by step_size >>> c = WriteBedGraph([bam_file], binLength=bin_length, region=region, stepSize=step_size) >>> c.run(function_to_call, funcArgs, outFile.name) >>> f = open(outFile.name, 'r') >>> f.readlines() ['3R\t0\t100\t0\n', '3R\t100\t200\t1.5\n'] >>> f.close() >>> outFile.close()
- run(func_to_call, func_args, out_file_prefix, blackListFileName=None, format='bedgraph', smoothLength=0, normUsing=None)[source]
Given a list of bamfiles, a function and a function arguments, this method writes a bedgraph file (or bigwig) file for a partition of the genome into tiles of given size and a value for each tile that corresponds to the given function and that is related to the coverage underlying the tile.
Parameters
- func_to_callstr
function name to be called to convert the list of coverages computed for each bam file at each position into a single value.
- func_argsdict
dict of arguments to pass to func. E.g. {‘scaleFactor’:1.0}
- out_file_prefixstr
name of the file to save the resulting data.
- writeBedGraph_worker(chrom, start, end, func_to_call, func_args, bed_regions_list=None)[source]
Writes a bedgraph based on the read coverage per group of cells, indicated by cluster_info data frame.
The given func is called to compute the desired bedgraph value using the funcArgs
Parameters
- chromstr
Chrom name
- startint
start coordinate
- endint
end coordinate
- func_to_callstr
function name to be called to convert the list of coverages computed for each bam file at each position into a single value. An example is a function that takes the ratio between the coverage of two bam files.
- func_argsdict
dict of arguments to pass to func.
- smoothLengthint
Distance in bp for smoothing the coverage per tile.
- bed_regions_list: list
List of tuples of the form (chrom, start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.
Returns
A list of [chromosome, start, end, temporary file], where the temporary file contains the bedgraph results for the region queried.
Examples
>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/" >>> bamFile1 = test_path + "testA.bam" >>> bin_length = 50 >>> number_of_samples = 0 # overruled by step_size >>> func_to_call = scaleCoverage >>> funcArgs = {'scaleFactor': 1.0}
>>> c = WriteBedGraph([bamFile1], bin_length, number_of_samples, stepSize=50) >>> tempFile = c.writeBedGraph_worker( '3R', 0, 200, func_to_call, funcArgs) >>> f = open(tempFile[3], 'r') >>> f.readlines() ['3R\t0\t100\t0\n', '3R\t100\t200\t1\n'] >>> f.close() >>> os.remove(tempFile[3])
sincei.multimodalClustering module
- sincei.multimodalClustering.multiModal_clustering(mode1_adata, mode2_adata, column_key='barcode_nla', nK=20)[source]
Performs multi-graph clustering on matched keys(barcodes) bw two anndata objects.
Parameters
- mode1_adataAnnData
AnnData object for mode 1
- mode2_adataAnnData
AnnData object for mode 2
- column_keystr
Column name for the barcode in mode 1 and 2
- nKint
Number of clusters to use for clustering
Returns
- multi_umapDataFrame
DataFrame with UMAP coordinates and cluster labels
- mode1_adataAnnData
AnnData object for mode 1
- mode2_adataAnnData
AnnData object for mode 2
- sincei.multimodalClustering.umap_aligned(pca_mode1, pca_mode2, nK=15, distance_metric='eucledian')[source]
Aligns two UMAP embeddings using the UMAP AlignedUMAP class
Parameters
- pca_mode1pandas.DataFrame
UMAP embedding of RNA data
- pca_mode2pandas.DataFrame
UMAP embedding of CHiC data
- nKint
Number of nearest neighbors to use for UMAP
- distance_metricstr
Distance metric to use for UMAP
Returns
- pandas.DataFrame
Aligned UMAP embedding of RNA data
- pandas.DataFrame
Aligned UMAP embedding of CHiC data
Examples
>>> test = Tester() >>> pca_mode1 = test.pca_mode1 >>> pca_mode2 = test.pca_mode2 >>> umap_aligned(pca_mode1, pca_mode2) ( aligned_UMAP1 aligned_UMAP2 0 -0.012995 0.001206 1 0.
sincei.scBAMops module
sincei.scBulkCoverage module
- class sincei.scBulkCoverage.CenterFragment(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]
Bases:
WriteBedGraph
Class to redefine the get_fragment_from_read for the –MNase case
The coverage of the fragment is defined as the 2 or 3 basepairs at the center of the fragment length.
- class sincei.scBulkCoverage.OffsetFragment(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]
Bases:
WriteBedGraph
Class to redefine the get_fragment_from_read for the –Offset case
- filterStrand(read, rv)[source]
A generic read filtering function that gets used by everything in this class.
rv is returned if the strand is correct, otherwise [(None, None)]