sincei package

Submodules

sincei.ExponentialFamily module

class sincei.ExponentialFamily.Bernoulli(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Bernoulli distribution

family_params of interest:

“max_val” (int) corresponding to the max value (replaces infinity).

Empirically, values above 10 yield similar results.

base_measure(X: Tensor)[source]

invert_g(X: Tensor)[source]

log_likelihood(X: Tensor, theta: Tensor)[source]: Computes negative log-likelihood between dataset X and parameters theta

log_partition(theta: Tensor)[source]

natural_parametrization(theta: Tensor)[source]

sufficient_statistics(X: Tensor)[source]

class sincei.ExponentialFamily.Beta(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Beta distribution, using a standard formulation.

Original formulation presented in [Mourragui et al, 2023].

family_params of interest:

“min_val” (int): min data value (replaces 0 and 1).
“n_jobs” (int): number of jobs, specifically

for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.

base_measure(X: Tensor)[source]

exponential_term(X: Tensor, theta: Tensor)[source]

initialize_family_parameters(X: Optional[Tensor] = None)[source]: General method to initialize certain parameters (e.g. for Beta or Negative Binomial

invert_g(X: Optional[Tensor] = None)[source]: Dichotomy to find where derivative maxes out

log_partition(theta: Tensor)[source]

natural_parametrization(theta: Tensor)[source]

sufficient_statistics(X: Tensor)[source]

class sincei.ExponentialFamily.ExponentialFamily(family_params=None, **kwargs)[source]

Bases: object

Encodes an exponential family distribution using PyTorch autodiff structures.

ExponentialFamily corresponds to the superclass providing a backbone for all exponential family. Each subclass should contain the following methods, defined based on the distribution of choice (same notation as [Mourragui et al, 2023]):

sufficient_statistics ($T$),

natural_parametrization ($eta$),

log_partition ($A$).

invert_g ($g^{-1}$).

initialize_family_parameters: computes parameters used in other

methods, e.g., gene-level dispersion for Negative Binomial.

We added a “base_measure” for sake of completeness, but this method is not necessary for running GLM-PCA. The log-likelihood and exponential term are defined directly from the aforementionned methods.

Parameters

family_nameint: Name of the family.

base_measure(X: Tensor)[source]

distribution(X: Tensor, theta: Tensor)[source]

exponential_term(X: Tensor, theta: Tensor)[source]

initialize_family_parameters(X: Optional[Tensor] = None)[source]: General method to initialize certain parameters (e.g. for Beta or Negative Binomial

invert_g(X: Tensor)[source]

load_family_params_to_gpu(device)[source]

log_distribution(X: Tensor, theta: Tensor)[source]

log_likelihood(X: Tensor, theta: Tensor)[source]: Computes negative log-likelihood between dataset X and parameters theta

log_partition(theta: Tensor)[source]

natural_parametrization(theta: Tensor)[source]

sufficient_statistics(X: Tensor)[source]

class sincei.ExponentialFamily.Gamma(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Gamma distribution using a standard formulation.

Original formulation presented in [Mourragui et al, 2023].

family_params of interest:

“min_val” (int): min data value, default to 1e-5.
“max_val” (int): max data value, default to 10e6.
“n_jobs” (int): number of jobs, specifically

for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.

exponential_term(X: Tensor, theta: Tensor)[source]

initialize_family_parameters(X: Optional[Tensor] = None)[source]: General method to initialize certain parameters (e.g. for Beta or Negative Binomial

invert_g(X: Optional[Tensor] = None)[source]: Dichotomy to compute inverse of digamma function.

log_partition(theta: Tensor)[source]

natural_parametrization(theta: Tensor)[source]

sufficient_statistics(X: Tensor)[source]

class sincei.ExponentialFamily.Gaussian(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Gaussian with standard deviation one.

GLMPCA with Gaussian as family is equivalent to the standard PCA.

base_measure(X: Tensor)[source]

invert_g(X: Tensor)[source]

log_partition(theta: Tensor)[source]

natural_parametrization(theta: Tensor)[source]

sufficient_statistics(X: Tensor)[source]

class sincei.ExponentialFamily.LogNormal(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Log-normal distribution using a standard formulation.

Original formulation presented in [Mourragui et al, 2023].

family_params of interest:

“min_val” (int): min data value, default to 1e-5.
“max_val” (int): max data value, default to 10e6.
“n_jobs” (int): number of jobs, specifically

for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.

base_measure(X: Tensor)[source]

exponential_term(X: Tensor, theta: Tensor)[source]

initialize_family_parameters(X: Optional[Tensor] = None)[source]: General method to initialize certain parameters (e.g. for Beta or Negative Binomial

invert_g(X: Optional[Tensor] = None)[source]

log_partition(theta: Tensor)[source]

natural_parametrization(theta: Tensor)[source]

sufficient_statistics(X: Tensor)[source]

class sincei.ExponentialFamily.Poisson(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Poisson distribution

family_params of interest:

“min_val” (int) corresponding to the min value (replaces 0).

base_measure(X: Tensor)[source]

invert_g(X: Tensor)[source]

log_distribution(X: Tensor, theta: Tensor)[source]: The computation of gamma function for the base measure (h) would lead to inf, hence a re-design of the method.

log_partition(theta: Tensor)[source]

natural_parametrization(theta: Tensor)[source]

sufficient_statistics(X: Tensor)[source]

class sincei.ExponentialFamily.SigmoidBeta(family_params=None, **kwargs)[source]

Bases: Beta

Beta distribution re-parametrized using a Sigmoid.

This distribution is similar to the previous Beta (which it inherits from) but the natural parameter is re-parametrized using a Sigmoid. This is shown expeerimentally to stabilize the optimisation by removing the ]0,1[ constraint.

family_params of interest:

“min_val” (int): min data value (replaces 0 and 1).
“n_jobs” (int): number of jobs, specifically

for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.

invert_g(X: Optional[Tensor] = None)[source]: Dichotomy to find where derivative maxes out

natural_parametrization(theta: Tensor)[source]

sincei.FragmentFFT module

sincei.FragmentFFT.fftplot(outdict, plot)[source]

Computes the periodicity of the fragment distribution

Parameters

outdictdict: Dictionary containing the fragment distribution for each barcode
plotstr: File name to save the plot

Returns

dict: Dictionary containing the periodicity of the fragment distribution for each barcode

Examples

>>> test = Tester()
>>> d = test.get_fragment_distribution()
>>> fftplot(d, 'test.png')

sincei.FragmentFFT.ffttable(selected)[source]

Computes the FFT of the fragment length distribution

Parameters

selectednumpy array: Array of fragment lengths

Returns

pandas dataframe: Dataframe with two columns. The first column contains the frequencies and the second column the power spectrum.

Examples

>>> test = Tester()
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4)
>>> num_reads_per_bin, regionList = c.run()
>>> selected = num_reads_per_bin[:,0]
>>> ffttable(selected)
   freq         value
0   0.0  7.812500e+06
1   1.0  1.812500e+06
2   2.0  1.812500e+06
3   3.0  1.812500e+06
4   4.0  1.812500e+06
5   5.0  1.812500

sincei.FragmentFFT.fragment_distribution(fragment_len_dict, length_plot)[source]

Plot fragment length distribution

Parameters

fragment_len_dictdict: Dictionary containing the fragment length for each barcode.
length_plotstr: File name to save the plot.

Returns

dict: Dictionary containing the top 20 frequencies for each barcode.

Examples

>>> test = Tester()
>>> fragment_len_dict = {'AAACCTGAGAGGTTCT': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
...                      'AAACCTGAGAGGTTCT': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]}
>>> fragment_distribution(fragment_len_dict, test.length_plot)

sincei.GLMPCA module

class sincei.GLMPCA.GLMPCA(n_pc, family='gaussian', family_params=None, max_iter=100, learning_rate=0.2, batch_size=256, step_size=20, gamma=0.5, n_init=1, init='spectral', n_jobs=1)[source]

Bases: object

Performs GLM-PCA on a data matrix to reduce dimensionality

This class computes the generalized-linear model principal components (GLM-PC) of a dataset by exploiting the framework of saturated parameters. Specifically, given an exponential family distribution choosen following some prior knowledge, GLM-PCA will find a collection of directions which maximise the reconstruction error, computed as the negative log-likelihood of the chosen exponential family.

By making use of an alternative formulation, our implementation can exploit automatic differentiation and can therefore rely on mini-batch Stochastic Gradient Descent. As a consequence, it scales to very large dataset.

Another interesting feature of our implementation is that it does not require cumbersome Lagrangian derivations. If you wish to test an exponential family distribution not present in our implementation, adding a class in ExponentialFamily with the different density functionals defined there would suffice to use GLM-PCA.

Parameters

n_pcint

Number of principal components.

family: str

Name of the exponential family distribution. Possible families: “gaussian”, “poisson”, “bernoulli”, “beta”, “gamma”, “log_normal”, “log_beta”:, “sigmoid_beta”. Default to “gaussian”

family_paramsdict

Dictionary with family parameters to be added. List of parameters depend on the specific ExponentialFamily class chosen. Examples:

“n_jobs” (int) for parallelization, specifically for “beta” and “gamma”.

“min_val” (float) for truncating in “poisson” or “beta”.

“eps” (float) for convergence in inverse computation in “beta”.

Default to None.

max_iterint

Maximum number of epochs in the GLM-PCA optimisation. Default to 100.

learning_rate: float

Learning rate to be used in the GLM-PCA optimisation. If learning_rate is too high and lead to NaN, our implementation automatically restarts the optimisation with a smaller value. Default to 0.2.

batch_sizeint

Size of the batch in the SGD optimisation step. Default to 256.

step_size: int

Step size in optimiser scheduler. See more: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html Default to 20.

gamma: int

Reduction parameter for optimiser scheduler. See more: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html Default to 0.5

n_init: int

Number of GLM-PCA initializations. Useful if you want to explore different random seeds and starting points. Default to 1.

init: str

Method to initialize loadings. “spectral” performs SVD on the saturated parameters from a small random batch of the dataset, “random” performs a random initialization on the Stiefel manifold. Default to “spectral”.

n_jobs: int

Number of jobs used in parallel operations. Default to 1.

fit(X)[source]

Fits a GLM-PCA to a specific dataset.

Parameters

Xtorch.Tensor or np.array: Dataset with cells in the rows and features in the columns.

Returns

bool: Returns True if the fitting procedure has been successful.

transform(X)[source]

Transforms and project dataset X onto the principal components.

Parameters

Xtorch.Tensor or np.array: Dataset with cells in the rows and features in the columns.

Returns

torch.Tensor: Projected saturated parameters.

sincei.GetStats module

sincei.GetStats.getStats_worker(arglist)[source]

Computes statistics for each read in a bam file

This function computes statistics for each read in a bam file.

Parameters

bamfileslist: List containing the names of indexed bam files. E.g. [‘file1.bam’, ‘file2.bam’]
binSizeint: Length of the window/bin. This value is overruled by bedFile if present.
barcodeslist: list of barcodes to count the reads from.
numberOfSamplesint: Total number of samples. The genome is divided into numberOfSamples, each with a window/bin length equal to binLength. This value is overruled by stepSize in case such value is present and by bedFile in which case the number of samples and bins are defined in the bed file
numberOfProcessorsint: Number of processors to use. Default is 4
verbosebool: Output messages. Default: False
regionstr: Region to limit the computation in the form chrom:start

sincei.ParserCommon module

sincei.ParserCommon.bamOptions(args=None, suppress_args=None, default_opts=None)[source]: Arguments for tools that operate on BAM files

sincei.ParserCommon.filterOptions(args=None, suppress_args=None)[source]

sincei.ParserCommon.genomicRegion(string)[source]

sincei.ParserCommon.get_default(name, argdict)[source]

sincei.ParserCommon.inputOutputOptions(args=None, opts=None, requiredOpts=[], suppress_args=None)[source]

sincei.ParserCommon.numberOfProcessors(string)[source]

sincei.ParserCommon.otherOptions(args=None)[source]

sincei.ParserCommon.plotOptions(args=None)[source]

sincei.ParserCommon.process_args(args=None)[source]

sincei.ParserCommon.readOptions(args=None, suppress_args=None)[source]: Common arguments related to BAM files and the interpretation of the read coverage

sincei.ParserCommon.show_or_hide(help, name, arglist)[source]

sincei.ParserCommon.smartLabel(label)[source]: Remove the path name and the last extension from the file name /pth/to/file.name.bam -> file.name

sincei.ParserCommon.smartLabels(labels)[source]

sincei.ParserCommon.validateInputs(args, process_barcodes=True)[source]: Ensure that right input is provided from argparse

sincei.ReadCounter module

class sincei.ReadCounter.CountReadsPerBin(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]

Bases: object

Collects coverage over multiple bam files using multiprocessing

This function collects read counts (coverage) from several bam files and returns an numpy array with the results. This class uses multiprocessing to compute the coverage.

Parameters

bamFilesListlist: List containing the names of indexed bam files. E.g. [‘file1.bam’, ‘file2.bam’]
binLengthint: Length of the window/bin. This value is overruled by bedFile if present.
barcodeslist: list of barcodes to count the reads from.
numberOfSamplesint: Total number of samples. The genome is divided into numberOfSamples, each with a window/bin length equal to binLength. This value is overruled by stepSize in case such value is present and by bedFile in which case the number of samples and bins are defined in the bed file
numberOfProcessorsint: Number of processors to use. Default is 4
verbosebool: Output messages. Default: False
regionstr: Region to limit the computation in the form chrom:start:end.
bedFilelist of file_handles.: Each file handle corresponds to a bed file containing the regions for which to compute the coverage. This option overrules binLength, numberOfSamples and stepSize.
blackListFileNamestr: A string containing a BED file with blacklist regions.

extendReads : bool, int

Whether coverage should be computed for the extended read length (i.e. the region covered by the two mates or the regions expected to be covered by single-reads). If the value is ‘int’, then then this is interpreted as the fragment length to extend reads that are not paired. For Illumina reads, usual values are around 300. This value can be determined using the peak caller MACS2 or can be approximated by the fragment lengths computed when preparing the library for sequencing. If the value is of the variable is true and not value is given, the fragment size is sampled from the library but only if the library is paired-end. Default: False

minMappingQualityint: Reads of a mapping quality less than the give value are not considered. Default: None
duplicateFilterstr: Type of duplicate filter to use (same start, end position, umi and barcodes. If paired-end, same start-end for mates) are to be excluded. Default: None
chrToSkip: list: List with names of chromosomes that do not want to be included in the coverage computation. This is useful to remove unwanted chromosomes (e.g. ‘random’ or ‘Het’).
stepSizeint: the positions for which the coverage is computed are defined as follows: range(start, end, stepSize). Thus, a stepSize of 1, will compute the coverage at each base pair. If the stepSize is equal to the binLength then the coverage is computed for consecutive bins. If seepSize is smaller than the binLength, then teh bins will overlap.
center_readbool: Determines if reads should be centered with respect to the fragment length.
samFlag_includeint: Extracts only those reads having the SAM flag. For example, to get only reads that are the first mates a samFlag of 64 could be used. Similarly, the samFlag_include can be used to select only reads mapping on the reverse strand or to get only properly paired reads.
samFlag_excludeint: Removes reads that match the SAM flag. For example to get all reads that map to the forward strand a samFlag_exlude 16 should be used. Which translates into exclude all reads that map to the reverse strand.
zerosToNansbool: If true, zero values encountered are transformed to Nans. Default false.
skipZeroOverZerobool: If true, skip bins where all input BAM files have no coverage (only applicable to bamCompare).
minFragmentLengthint: If greater than 0, fragments below this size are excluded.
maxFragmentLengthint: If greater than 0, fragments above this size are excluded.
minAlignedFractionfloat: fragments where less than the given fraction of bases align are excluded.
motifFilterlist: Only alignments with reads containing the given motif at 5’-end and genome 5’-end are counted.
GCcontentFilterlist: Only alignments with given min and max GC content are counted.
genome2bitstr: 2 bit file for the genome (if motifFilter is specified)
out_file_for_raw_datastr: File name to save the raw counts computed
statsListlist: For each BAM file in bamFilesList, the associated per-chromosome statistics returned by openBam
mappedListlist: For each BAM file in bamFilesList, the number of mapped reads in the file.
bed_and_binboolean: If true AND a bedFile is given, compute coverage of each bin of the given size in each region of bedFile
sumCoveragePerBinboolean: If true return cumulative coverage per bin, instead of total read counts (for plotFingerPrint)
genomeChunkSizeint: If not None, the length of the genome used for multiprocessing.

Returns

numpy array

Each row correspond to each bin/bed region and each column correspond to each of the bamFiles.

Examples

The test data contains reads for 200 bp.

>>> test = Tester()

The transpose function is used to get a nicer looking output. The first line corresponds to the number of reads per bin in bam file 1

>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4)
>>> np.transpose(c.run())
array([[0., 0., 1., 1.],
       [0., 1., 1., 2.]])

count_reads_in_region(chrom, start, end, bed_regions_list=None)[source]

Counts the reads in each bam file at each ‘stepSize’ position within the interval (start, end) for a window or bin of size binLength.

The stepSize controls the distance between bins. For example, a step size of 20 and a bin size of 20 will create bins next to each other. If the step size is smaller than the bin size the bins will overlap.

If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.

Parameters

chromstr: Chrom name
startint: start coordinate
endint: end coordinate
barcodes: list: List of barcodes to count (currently set for tag ‘BC’ in the BAM)
bed_regions_list: list: List of list of tuples of the form (start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

Returns

numpy array: The result is a numpy array that as rows each bin and as columns each bam file.

Examples

Initialize some useful values

>>> test = Tester()
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 25, 0, stepSize=50)

The transpose is used to get better looking numbers. The first line corresponds to the number of reads per bin in the first bamfile.

>>> _array, __ = c.count_reads_in_region(test.chrom, 0, 200)
>>> _array
array([[0., 0.],
       [0., 1.],
       [1., 1.],
       [1., 2.]])

getReadLength(read)[source]

getSmoothRange(tileIndex, tileSize, smoothRange, maxPosition)[source]

Given a tile index position and a tile size (length), return the a new indices over a larger range, called the smoothRange. This region is centered in the tileIndex an spans on both sizes to cover the smoothRange. The smoothRange is trimmed in case it is less than zero or greater than maxPosition

---------------|==================|------------------
           tileStart
      |--------------------------------------|
      |    <--      smoothRange     -->      |
      |
tileStart - (smoothRange-tileSize)/2

Test for a smooth range that spans 3 tiles.

Examples

>>> c = CountReadsPerBin([], 1, 1, 1, 0)
>>> c.getSmoothRange(5, 1, 3, 10)
(4, 7)

Test smooth range truncated on start.

>>> c.getSmoothRange(0, 10, 30, 200)
(0, 2)

Test smooth range truncated on start.

>>> c.getSmoothRange(1, 10, 30, 4)
(0, 3)

Test smooth range truncated on end.

>>> c.getSmoothRange(5, 1, 3, 5)
(4, 5)

Test smooth range not multiple of tileSize.

>>> c.getSmoothRange(5, 10, 24, 10)
(4, 6)

get_chunk_length(bamFilesHandles, genomeSize, chromSizes, chrLengths)[source]

get_coverage_of_region(bamHandle, chrom, regions, fragmentFromRead_func=None)[source]

Returns a numpy array that corresponds to the number of reads that overlap with each tile.

>>> test = Tester()
>>> import pysam
>>> c = CountReadsPerBin([], stepSize=1, extendReads=300)

For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested.

>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000833, 5000834), (5000834, 5000835)])
array([4., 5.])

In the following example a paired read is extended to the fragment length which is 100 The first mate starts at 5000000 and the second at 5000064. Each mate is extended to the fragment length independently At position 500090-500100 one fragment of length 100 overlap, and after position 5000101 there should be zero reads.

>>> c.zerosToNans = True
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000090, 5000100), (5000100, 5000110)])
array([ 1., nan])

In the following case the reads length is 50. Reads are not extended.

>>> c.extendReads=False
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', [(148, 150), (150, 152), (152, 154)])
array([1., 2., 2.])

get_fragment_from_read(read)[source]

Get read start and end position of a read. If given, the reads are extended as follows: If reads are paired end, each read mate is extended to match the fragment length, otherwise, a default fragment length is used. If reads are split (give by the CIGAR string) then the multiple positions of the read are returned. When reads are extended the cigar information is skipped.

Parameters

read: pysam object.

The following values are defined (for forward reads):

     |--          -- read.tlen --              --|
     |-- read.alen --|
-----|===============>------------<==============|----
     |               |            |
read.reference_start
            read.reference_end  read.pnext

  and for reverse reads


     |--             -- read.tlen --           --|
                                 |-- read.alen --|
-----|===============>-----------<===============|----
     |                           |               |
  read.pnext           read.reference_start  read.reference_end

this is a sketch of a pair-end reads

The function returns the fragment start and end, either using the paired end information (if available) or extending the read in the appropriate direction if this is single-end.

Parameters

read : pysam read object

Returns

list of tuples: [(fragment start, fragment end)]

>>> test = Tester()
>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True)
>>> c.defaultFragmentLength=100
>>> c.get_fragment_from_read(test.getRead("paired-forward"))
[(5000000, 5000100)]
>>> c.get_fragment_from_read(test.getRead("paired-reverse"))
[(5000000, 5000100)]
>>> c.defaultFragmentLength = 200
>>> c.get_fragment_from_read(test.getRead("single-forward"))
[(5001491, 5001691)]
>>> c.get_fragment_from_read(test.getRead("single-reverse"))
[(5001536, 5001736)]
>>> c.defaultFragmentLength = 'read length'
>>> c.get_fragment_from_read(test.getRead("single-forward"))
[(5001491, 5001527)]
>>> c.defaultFragmentLength = 'read length'
>>> c.extendReads = False
>>> c.get_fragment_from_read(test.getRead("paired-forward"))
[(5000000, 5000036)]

Tests for read centering.

>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True, center_read=True)
>>> c.defaultFragmentLength = 100
>>> assert(c.get_fragment_from_read(test.getRead("paired-forward")) == [(5000032, 5000068)])
>>> c.defaultFragmentLength = 200
>>> assert(c.get_fragment_from_read(test.getRead("single-reverse")) == [(5001618, 5001654)])

static is_proper_pair(read, maxPairedFragmentLength)[source]

Checks if a read is proper pair meaning that both mates are facing each other and are in the same chromosome and are not to far away. The sam flag for proper pair can not always be trusted. Note that if the fragment size is > maxPairedFragmentLength (~2kb usually) that False will be returned. :return: bool

>>> import pysam
>>> import os
>>> from deeptools.countReadsPerBin import CountReadsPerBin as cr
>>> root = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bam = pysam.AlignmentFile("{}/test_proper_pair_filtering.bam".format(root))
>>> iter = bam.fetch()
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "keep" read
True
>>> cr.is_proper_pair(read, 200) # "keep" read, but maxPairedFragmentLength is too short
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "improper pair"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "mismatch chr"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "same orientation1"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "same orientation2"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "rev first"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "rev first OK"
True
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "for first"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "for first"
True

run(allArgs=None)[source]

class sincei.ReadCounter.Tester[source]

Bases: object

getRead(readType)[source]: prepare arguments for test

sincei.ReadCounter.countReadsInRegions_wrapper(args)[source]: Passes the arguments to countReadsInRegions_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the countReadsPerBin object

sincei.ReadCounter.estimateSizeFactors(m)[source]

Compute size factors in the same way as DESeq2. The inverse of that is returned, as it’s then compatible with bamCoverage.

m : a numpy ndarray

>>> m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 10, 0], [10, 5, 100]])
>>> sf = estimateSizeFactors(m)
>>> assert(np.all(np.abs(sf - [1.305, 0.9932, 0.783]) < 1e-4))
>>> m = np.array([[0, 0], [0, 1], [1, 1], [1, 2]])
>>> sf = estimateSizeFactors(m)
>>> assert(np.all(np.abs(sf - [1.1892, 0.8409]) < 1e-4))

sincei.ReadCounter.remove_row_of_zeros(matrix)[source]: remove rows containing all zeros or all nans

sincei.RegionQuery module

sincei.RegionQuery.get_bins_by_gene(dict, gene, firstBin=False)[source]

Returns the bins for a given gene.

Parameters

dictdict: Dictionary of bins and genes.
genestr: Gene name.
firstBinbool: If true, return only the first bin of the gene.

Returns

list: List of bins.

Examples

>>> dict = {'chr1_1': [('gene1', '+'), ('gene2', '-')], 'chr1_2': [('gene1', '+')]}
>>> get_bins_by_gene(dict, 'gene1')
['chr1_1', 'chr1_2']
>>> get_bins_by_gene(dict, 'gene1', firstBin=True)
'chr1_1'

sincei.RegionQuery.get_gtf_adata_olaps(adata, gtf)[source]

Get overlaps between adata and gtf

Parameters

adataAnnData: Annotated data matrix.
gtfGTF: GTF object

Returns

dict: Dictionary with overlaps for each gene in adata.

Examples

>>> test = Tester()
>>> gtf = GTF(test.gtfFile)
>>> adata = sc.read_10x_mtx(test.input_matrix_dir, var_names='gene_symbols', cache=True)
>>> olaps=get_gtf_adata_olaps(adata, gtf)
>>> olaps['Gm37381']
[('ENSMUSG00000064372', '+'), ('ENSMUSG00000064372', '-')]

sincei.TopicModels module

class sincei.TopicModels.TOPICMODEL(mat, cells, regions, n_topics, smart_code='lfu', n_passes=1, n_workers=1)[source]

Bases: object

Computes LSA or LDA for a given matrix and returns the cell-topic matrix

Parameters

matscipy.sparse.csr_matrix: Sparse matrix of shape (cells, regions)
cellslist: List of Cell IDs (corresponding to the input matrix rows)
regionslist: List of Regions (corresponding to the input matrix columns)
n_topicsint: Number of Topics / Principal Components
smart_codestr: SMART (System for the Mechanical Analysis and Retrieval of Text) code for weighting of input matrix for TFIDF. Only valid for the LSA model. The default (“lfu”) corresponds to “log”TF * IDF, and “pivoted unique” normalization of document length. For more information, see: https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System

Returns

An object of class TOPICMODELS.

get_cell_topic(pop_sparse_cells=False)[source]

Get cell-topic matrix from the updated object

Returns

cell_topicpandas dataframe: Cell-topic matrix

runLDA()[source]

Computes LDA model for a given matrix and returns the updated object

Returns

cell_topicpandas dataframe: Cell-topic matrix

runLSA()[source]

Computes LSA for a given matrix and returns the updated object

Returns

corpus_tfidfgensim corpus: TFIDF normalized corpus
corpus_lsigensim corpus: LSA corpus

sincei.Utilities module

sincei.Utilities.checkAlignedFraction(read, lowFilter)[source]

Check whether the fraction of read length that aligns to the reference is higher than the given threshold. Aligned fraction includes the max allowed mismatches tolerated by the aligner, and excludes InDels and Clippings.

Return: Bool

sincei.Utilities.checkBAMtag(bam, name, tag)[source]

sincei.Utilities.checkGCcontent(read, lowFilter, highFilter, returnGC=False)[source]

Checks if the GC content of the read is within the given range

Parameters

readpysam.AlignedSegment: A pysam AlignedSegment object
lowFilterfloat: Minimum GC content
highFilterfloat: Maximum GC content
returnGCbool: If true, return the GC content of the read

Returns

bool: True if the GC content of the read is within the given range

Examples

>>> test = Tester()
>>> read = test.bamFile1.fetch().next()
>>> checkGCcontent(read, 0.3, 0.7)
True

sincei.Utilities.checkMotifs(read, chrom, genome, readMotif, refMotif)[source]

Check whether a given motif is present in the read and the corresponding reference genome. For example, in MNAse (scChIC-seq) data, we expect the reads to have an ‘A’ at the 5’-end, while the genome has a ‘TA’ over hang (where the ‘A’ aligns with ‘A’ in the forward read), like this below.

Forwards aligned read: read has ‘A’, upstream has T R1 ……..A——-> ———-TA————Ref (+)

Rev aligned read: read has ‘T’, downstream has A

<——-T……. R1 ——–TA————Ref (+)

This function can look for any arbitrary motif in read and corresponding genome, but in the same orientation as described above.

Returns:: bool

>>> import pysam
>>> import os
>>> from scDeepTools.scReadCounter import CountReadsPerBin as cr
>>> root = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bam = pysam.AlignmentFile("{}/test_TA_filtering.bam".format(root))
>>> iter = bam.fetch()
>>> read = next(iter)
>>> cr.checkMotifs(read, 'A', 'TA') # for valid scChIC read
True
>>> read = next(iter)
>>> cr.checkMotifs(read, 'A', 'TA') # for invalid scChIC read
False

sincei.Utilities.colorPicker(name)[source]

This function returns a list of colors for plotting.

Parameters

namestr: The name of the color palette to use.

Returns

list: A list of colors.

Examples

>>> colorPicker('twentyfive')
['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00', '#ffff33', '#a65628', '#f781bf', '#999999']

>>> colorPicker('colorblind')
['#0072B2', '#009E73', '#D55E00', '#CC79A7', '#F0E442', '#56B4E9', '#E69F00', '#000000']

sincei.Utilities.getDupFilterTuple(read, bc, filterArg)[source]

Returns a tuple with the information needed to filter duplicates, based on read and filter type. The tuple is composed of the barcode, the umi, the start and end positions and the chromosome name.

Parameters

readpysam.AlignedSegment: A pysam.AlignedSegment object
bcstr: The barcode
filterstr: A string with the type of filter to use.

Returns

tuple

A tuple with the information needed to filter duplicates. The tuple is composed of the barcode, the umi, the start and end positions and the chromosome name.

Examples

>>> test = Tester()
>>> read = test.bamFile1.fetch().next()
>>> getDupFilterTuple(read, 'ATCG', 'end_umi')
('ATCG', 'ATCG', None, None, 0, False)

sincei.Utilities.gini(i, X)[source]

Computes the Gini coefficient for each row of a sparse matrix (Obs*Var).

Parameters

iint: row index
Xnumpy array: matrix

Returns

float: Gini coefficient for the given row

Examples

>>> X = np.matrix([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
>>> gini(0, X)
0.0
>>> gini(1, X)
0.0
>>> gini(2, X)
0.0

sincei.WriteBedGraph module

class sincei.WriteBedGraph.WriteBedGraph(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]

Bases: CountReadsPerBin

Reads bam files coverages and writes a bedgraph or bigwig file

Extends the CountReadsPerBin object such that the coverage of bam files is writen to multiple bedgraph files at once.

The bedgraph files are later merge into one and converted into a bigwig file if necessary.

The constructor arguments are the same as for CountReadsPerBin. However, when calling the run method, the following parameters have to be passed

Examples

Given the following distribution of reads that cover 200 on a chromosome named ‘3R’:

  0                              100                           200
  |------------------------------------------------------------|
A                                ===============
                                                ===============


B                 ===============               ===============
                                 ===============
                                                ===============

>>> import tempfile
>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"

>>> outFile = tempfile.NamedTemporaryFile()
>>> bam_file = test_path +  "testA.bam"

For the example a simple scaling function is going to be used. This function takes the coverage found at each region and multiplies it to the scaling factor. In this case the scaling factor is 1.5

>>> function_to_call = scaleCoverage
>>> funcArgs = {'scaleFactor': 1.5}

Restrict process to a region between positions 0 and 200 of chromosome 3R

>>> region = '3R:0:200'

Set up such that coverage is computed for consecutive bins of length 25 bp >>> bin_length = 25 >>> step_size = 25

>>> num_sample_sites = 0 #overruled by step_size
>>> c = WriteBedGraph([bam_file], binLength=bin_length, region=region, stepSize=step_size)
>>> c.run(function_to_call, funcArgs, outFile.name)
>>> f = open(outFile.name, 'r')
>>> f.readlines()
['3R\t0\t100\t0\n', '3R\t100\t200\t1.5\n']
>>> f.close()
>>> outFile.close()

run(func_to_call, func_args, out_file_prefix, blackListFileName=None, format='bedgraph', smoothLength=0, normUsing=None)[source]

Given a list of bamfiles, a function and a function arguments, this method writes a bedgraph file (or bigwig) file for a partition of the genome into tiles of given size and a value for each tile that corresponds to the given function and that is related to the coverage underlying the tile.

Parameters

func_to_callstr: function name to be called to convert the list of coverages computed for each bam file at each position into a single value.
func_argsdict: dict of arguments to pass to func. E.g. {‘scaleFactor’:1.0}
out_file_prefixstr: name of the file to save the resulting data.

writeBedGraph_worker(chrom, start, end, func_to_call, func_args, bed_regions_list=None)[source]

Writes a bedgraph based on the read coverage per group of cells, indicated by cluster_info data frame.

The given func is called to compute the desired bedgraph value using the funcArgs

Parameters

chromstr: Chrom name
startint: start coordinate
endint: end coordinate
func_to_callstr: function name to be called to convert the list of coverages computed for each bam file at each position into a single value. An example is a function that takes the ratio between the coverage of two bam files.
func_argsdict: dict of arguments to pass to func.
smoothLengthint: Distance in bp for smoothing the coverage per tile.
bed_regions_list: list: List of tuples of the form (chrom, start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

Returns

A list of [chromosome, start, end, temporary file], where the temporary file contains the bedgraph results for the region queried.

Examples

>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bamFile1 = test_path +  "testA.bam"
>>> bin_length = 50
>>> number_of_samples = 0 # overruled by step_size
>>> func_to_call = scaleCoverage
>>> funcArgs = {'scaleFactor': 1.0}

>>> c = WriteBedGraph([bamFile1], bin_length, number_of_samples, stepSize=50)
>>> tempFile = c.writeBedGraph_worker( '3R', 0, 200, func_to_call, funcArgs)
>>> f = open(tempFile[3], 'r')
>>> f.readlines()
['3R\t0\t100\t0\n', '3R\t100\t200\t1\n']
>>> f.close()
>>> os.remove(tempFile[3])

sincei.WriteBedGraph.scaleCoverage(tile_coverage, args)[source]: Return coverage per cluster as sum of cells. tileCoverage should be an list with only one element

sincei.WriteBedGraph.writeBedGraph_wrapper(args)[source]: Passes the arguments to writeBedGraph_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the WriteBedGraph object

sincei.multimodalClustering module

sincei.multimodalClustering.multiModal_clustering(mode1_adata, mode2_adata, column_key='barcode_nla', nK=20)[source]

Performs multi-graph clustering on matched keys(barcodes) bw two anndata objects.

Parameters

mode1_adataAnnData: AnnData object for mode 1
mode2_adataAnnData: AnnData object for mode 2
column_keystr: Column name for the barcode in mode 1 and 2
nKint: Number of clusters to use for clustering

Returns

multi_umapDataFrame: DataFrame with UMAP coordinates and cluster labels
mode1_adataAnnData: AnnData object for mode 1
mode2_adataAnnData: AnnData object for mode 2

sincei.multimodalClustering.umap_aligned(pca_mode1, pca_mode2, nK=15, distance_metric='eucledian')[source]

Aligns two UMAP embeddings using the UMAP AlignedUMAP class

Parameters

pca_mode1pandas.DataFrame: UMAP embedding of RNA data
pca_mode2pandas.DataFrame: UMAP embedding of CHiC data
nKint: Number of nearest neighbors to use for UMAP
distance_metricstr: Distance metric to use for UMAP

Returns

pandas.DataFrame: Aligned UMAP embedding of RNA data
pandas.DataFrame: Aligned UMAP embedding of CHiC data

Examples

>>> test = Tester()
>>> pca_mode1 = test.pca_mode1
>>> pca_mode2 = test.pca_mode2
>>> umap_aligned(pca_mode1, pca_mode2)
(                 aligned_UMAP1  aligned_UMAP2
0  -0.012995  0.001206
1   0.

sincei.scBAMops module

sincei.scBAMops.convertBED(oname, tmpFiles, chromDict)[source]

Stores results in BEDPE format, which is: chromosome frag_leftend frag_rightend

The fragment ends can be shifted

sincei.scBAMops.filterWorker(arglist)[source]

sincei.scBAMops.get_args()[source]

sincei.scBAMops.main(args=None)[source]

sincei.scBAMops.parseArguments()[source]

sincei.scBAMops.shiftRead(b, chromDict, args)[source]

sincei.scBulkCoverage module

class sincei.scBulkCoverage.CenterFragment(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]

Bases: WriteBedGraph

Class to redefine the get_fragment_from_read for the –MNase case

The coverage of the fragment is defined as the 2 or 3 basepairs at the center of the fragment length.

get_fragment_from_read(read)[source]: Takes a proper pair fragment of high quality and limited to a certain length and outputs the center

class sincei.scBulkCoverage.OffsetFragment(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]

Bases: WriteBedGraph

Class to redefine the get_fragment_from_read for the –Offset case

filterStrand(read, rv)[source]

A generic read filtering function that gets used by everything in this class.

rv is returned if the strand is correct, otherwise [(None, None)]

get_fragment_from_read(read)[source]: This is mostly a wrapper for self.get_fragment_from_read_list(), which needs a list and for the offsets to be tweaked by 1.

get_fragment_from_read_list(read, offset)[source]: Return the range of exons from the 0th through 1st bases, inclusive. Positions are 1-based

sincei.scBulkCoverage.get_args()[source]

sincei.scBulkCoverage.main(args=None)[source]

sincei.scBulkCoverage.parseArguments()[source]

sincei.scClusterCells module

sincei.scClusterCells.get_args()[source]

sincei.scClusterCells.main(args=None)[source]

sincei.scClusterCells.parseArguments()[source]

sincei.scCombineCounts module

sincei.scCombineCounts.get_args()[source]

sincei.scCombineCounts.main(args=None)[source]

sincei.scCombineCounts.parseArguments()[source]

sincei.scCountQC module

sincei.scCountQC.filter_adata(adata, filter_region_dict=None, filter_cell_dict=None, bad_chrom=None, bad_cells=None)[source]

sincei.scCountQC.get_args()[source]

sincei.scCountQC.main(args=None)[source]

sincei.scCountQC.parseArguments()[source]

sincei.scCountReads module

sincei.scCountReads.get_args()[source]

sincei.scCountReads.main(args=None)[source]

1. get read counts at different positions either all of same length or from genomic regions from the BED file

save data for further plotting

sincei.scCountReads.parseArguments(args=None)[source]

sincei.scFilterBarcodes module

sincei.scFilterBarcodes.count_occurrences(res)[source]

sincei.scFilterBarcodes.getFiltered_worker(arglist)[source]

sincei.scFilterBarcodes.get_args()[source]

sincei.scFilterBarcodes.ham_dist(s1, s2)[source]

sincei.scFilterBarcodes.main(args=None)[source]

sincei.scFilterBarcodes.parseArguments()[source]

sincei.scFilterStats module

sincei.scFilterStats.getFiltered_worker(arglist)[source]

sincei.scFilterStats.main(args=None)[source]

sincei.scFilterStats.parseArguments()[source]

sincei.scJSD module

sincei.scJSD.get_args()[source]

sincei.scJSD.main(args=None)[source]

sincei.scJSD.parseArguments(args=None)[source]

sincei.sincei module

sincei.sincei.main(args=None)[source]

sincei.sincei.parse_arguments(args=None)[source]

sincei.sincei.process_args(args=None)[source]