sincei package

Submodules

sincei.ExponentialFamily module

class sincei.ExponentialFamily.Bernoulli(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Bernoulli distribution

family_params of interest:
  • “max_val” (int) corresponding to the max value (replaces infinity).

Empirically, values above 10 yield similar results.

base_measure(X: Tensor)[source]
invert_g(X: Tensor)[source]
log_likelihood(X: Tensor, theta: Tensor)[source]

Computes negative log-likelihood between dataset X and parameters theta

log_partition(theta: Tensor)[source]
natural_parametrization(theta: Tensor)[source]
sufficient_statistics(X: Tensor)[source]
class sincei.ExponentialFamily.Beta(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Beta distribution, using a standard formulation.

Original formulation presented in [Mourragui et al, 2023].

family_params of interest:
  • “min_val” (int): min data value (replaces 0 and 1).

  • “n_jobs” (int): number of jobs, specifically

for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.

base_measure(X: Tensor)[source]
exponential_term(X: Tensor, theta: Tensor)[source]
initialize_family_parameters(X: Optional[Tensor] = None)[source]

General method to initialize certain parameters (e.g. for Beta or Negative Binomial

invert_g(X: Optional[Tensor] = None)[source]

Dichotomy to find where derivative maxes out

log_partition(theta: Tensor)[source]
natural_parametrization(theta: Tensor)[source]
sufficient_statistics(X: Tensor)[source]
class sincei.ExponentialFamily.ExponentialFamily(family_params=None, **kwargs)[source]

Bases: object

Encodes an exponential family distribution using PyTorch autodiff structures.

ExponentialFamily corresponds to the superclass providing a backbone for all exponential family. Each subclass should contain the following methods, defined based on the distribution of choice (same notation as [Mourragui et al, 2023]):

  • sufficient_statistics ($T$),

  • natural_parametrization ($eta$),

  • log_partition ($A$).

  • invert_g ($g^{-1}$).

  • initialize_family_parameters: computes parameters used in other

methods, e.g., gene-level dispersion for Negative Binomial.

We added a “base_measure” for sake of completeness, but this method is not necessary for running GLM-PCA. The log-likelihood and exponential term are defined directly from the aforementionned methods.

Parameters

family_nameint

Name of the family.

base_measure(X: Tensor)[source]
distribution(X: Tensor, theta: Tensor)[source]
exponential_term(X: Tensor, theta: Tensor)[source]
initialize_family_parameters(X: Optional[Tensor] = None)[source]

General method to initialize certain parameters (e.g. for Beta or Negative Binomial

invert_g(X: Tensor)[source]
load_family_params_to_gpu(device)[source]
log_distribution(X: Tensor, theta: Tensor)[source]
log_likelihood(X: Tensor, theta: Tensor)[source]

Computes negative log-likelihood between dataset X and parameters theta

log_partition(theta: Tensor)[source]
natural_parametrization(theta: Tensor)[source]
sufficient_statistics(X: Tensor)[source]
class sincei.ExponentialFamily.Gamma(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Gamma distribution using a standard formulation.

Original formulation presented in [Mourragui et al, 2023].

family_params of interest:
  • “min_val” (int): min data value, default to 1e-5.

  • “max_val” (int): max data value, default to 10e6.

  • “n_jobs” (int): number of jobs, specifically

for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.

exponential_term(X: Tensor, theta: Tensor)[source]
initialize_family_parameters(X: Optional[Tensor] = None)[source]

General method to initialize certain parameters (e.g. for Beta or Negative Binomial

invert_g(X: Optional[Tensor] = None)[source]

Dichotomy to compute inverse of digamma function.

log_partition(theta: Tensor)[source]
natural_parametrization(theta: Tensor)[source]
sufficient_statistics(X: Tensor)[source]
class sincei.ExponentialFamily.Gaussian(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Gaussian with standard deviation one.

GLMPCA with Gaussian as family is equivalent to the standard PCA.

base_measure(X: Tensor)[source]
invert_g(X: Tensor)[source]
log_partition(theta: Tensor)[source]
natural_parametrization(theta: Tensor)[source]
sufficient_statistics(X: Tensor)[source]
class sincei.ExponentialFamily.LogNormal(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Log-normal distribution using a standard formulation.

Original formulation presented in [Mourragui et al, 2023].

family_params of interest:
  • “min_val” (int): min data value, default to 1e-5.

  • “max_val” (int): max data value, default to 10e6.

  • “n_jobs” (int): number of jobs, specifically

for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.

base_measure(X: Tensor)[source]
exponential_term(X: Tensor, theta: Tensor)[source]
initialize_family_parameters(X: Optional[Tensor] = None)[source]

General method to initialize certain parameters (e.g. for Beta or Negative Binomial

invert_g(X: Optional[Tensor] = None)[source]
log_partition(theta: Tensor)[source]
natural_parametrization(theta: Tensor)[source]
sufficient_statistics(X: Tensor)[source]
class sincei.ExponentialFamily.Poisson(family_params=None, **kwargs)[source]

Bases: ExponentialFamily

Poisson distribution

family_params of interest:
  • “min_val” (int) corresponding to the min value (replaces 0).

base_measure(X: Tensor)[source]
invert_g(X: Tensor)[source]
log_distribution(X: Tensor, theta: Tensor)[source]

The computation of gamma function for the base measure (h) would lead to inf, hence a re-design of the method.

log_partition(theta: Tensor)[source]
natural_parametrization(theta: Tensor)[source]
sufficient_statistics(X: Tensor)[source]
class sincei.ExponentialFamily.SigmoidBeta(family_params=None, **kwargs)[source]

Bases: Beta

Beta distribution re-parametrized using a Sigmoid.

This distribution is similar to the previous Beta (which it inherits from) but the natural parameter is re-parametrized using a Sigmoid. This is shown expeerimentally to stabilize the optimisation by removing the ]0,1[ constraint.

family_params of interest:
  • “min_val” (int): min data value (replaces 0 and 1).

  • “n_jobs” (int): number of jobs, specifically

for computing the “nu” parameter. - “method” (str): method use to compute the “nu” parameter per feature. Two possibles: “MLE” and “MM”. Default to “MLE”. - “eps” (float): minimum difference used for inverting the g function. Default to 1e-4 - “maxiter” (int): maximum number of iterations for the inversion of the g function. Default to 100.

invert_g(X: Optional[Tensor] = None)[source]

Dichotomy to find where derivative maxes out

natural_parametrization(theta: Tensor)[source]

sincei.FragmentFFT module

sincei.FragmentFFT.fftplot(outdict, plot)[source]

Computes the periodicity of the fragment distribution

Parameters

outdictdict

Dictionary containing the fragment distribution for each barcode

plotstr

File name to save the plot

Returns

dict

Dictionary containing the periodicity of the fragment distribution for each barcode

Examples

>>> test = Tester()
>>> d = test.get_fragment_distribution()
>>> fftplot(d, 'test.png')
sincei.FragmentFFT.ffttable(selected)[source]

Computes the FFT of the fragment length distribution

Parameters

selectednumpy array

Array of fragment lengths

Returns

pandas dataframe

Dataframe with two columns. The first column contains the frequencies and the second column the power spectrum.

Examples

>>> test = Tester()
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4)
>>> num_reads_per_bin, regionList = c.run()
>>> selected = num_reads_per_bin[:,0]
>>> ffttable(selected)
   freq         value
0   0.0  7.812500e+06
1   1.0  1.812500e+06
2   2.0  1.812500e+06
3   3.0  1.812500e+06
4   4.0  1.812500e+06
5   5.0  1.812500
sincei.FragmentFFT.fragment_distribution(fragment_len_dict, length_plot)[source]

Plot fragment length distribution

Parameters

fragment_len_dictdict

Dictionary containing the fragment length for each barcode.

length_plotstr

File name to save the plot.

Returns

dict

Dictionary containing the top 20 frequencies for each barcode.

Examples

>>> test = Tester()
>>> fragment_len_dict = {'AAACCTGAGAGGTTCT': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000],
...                      'AAACCTGAGAGGTTCT': [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]}
>>> fragment_distribution(fragment_len_dict, test.length_plot)

sincei.GLMPCA module

class sincei.GLMPCA.GLMPCA(n_pc, family='gaussian', family_params=None, max_iter=100, learning_rate=0.2, batch_size=256, step_size=20, gamma=0.5, n_init=1, init='spectral', n_jobs=1)[source]

Bases: object

Performs GLM-PCA on a data matrix to reduce dimensionality

This class computes the generalized-linear model principal components (GLM-PC) of a dataset by exploiting the framework of saturated parameters. Specifically, given an exponential family distribution choosen following some prior knowledge, GLM-PCA will find a collection of directions which maximise the reconstruction error, computed as the negative log-likelihood of the chosen exponential family.

By making use of an alternative formulation, our implementation can exploit automatic differentiation and can therefore rely on mini-batch Stochastic Gradient Descent. As a consequence, it scales to very large dataset.

Another interesting feature of our implementation is that it does not require cumbersome Lagrangian derivations. If you wish to test an exponential family distribution not present in our implementation, adding a class in ExponentialFamily with the different density functionals defined there would suffice to use GLM-PCA.

Parameters

n_pcint

Number of principal components.

family: str

Name of the exponential family distribution. Possible families: “gaussian”, “poisson”, “bernoulli”, “beta”, “gamma”, “log_normal”, “log_beta”:, “sigmoid_beta”. Default to “gaussian”

family_paramsdict

Dictionary with family parameters to be added. List of parameters depend on the specific ExponentialFamily class chosen. Examples:

  • “n_jobs” (int) for parallelization, specifically for “beta” and “gamma”.

  • “min_val” (float) for truncating in “poisson” or “beta”.

  • “eps” (float) for convergence in inverse computation in “beta”.

Default to None.

max_iterint

Maximum number of epochs in the GLM-PCA optimisation. Default to 100.

learning_rate: float

Learning rate to be used in the GLM-PCA optimisation. If learning_rate is too high and lead to NaN, our implementation automatically restarts the optimisation with a smaller value. Default to 0.2.

batch_sizeint

Size of the batch in the SGD optimisation step. Default to 256.

step_size: int

Step size in optimiser scheduler. See more: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html Default to 20.

gamma: int

Reduction parameter for optimiser scheduler. See more: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html Default to 0.5

n_init: int

Number of GLM-PCA initializations. Useful if you want to explore different random seeds and starting points. Default to 1.

init: str

Method to initialize loadings. “spectral” performs SVD on the saturated parameters from a small random batch of the dataset, “random” performs a random initialization on the Stiefel manifold. Default to “spectral”.

n_jobs: int

Number of jobs used in parallel operations. Default to 1.

fit(X)[source]

Fits a GLM-PCA to a specific dataset.

Parameters

Xtorch.Tensor or np.array

Dataset with cells in the rows and features in the columns.

Returns

bool

Returns True if the fitting procedure has been successful.

transform(X)[source]

Transforms and project dataset X onto the principal components.

Parameters

Xtorch.Tensor or np.array

Dataset with cells in the rows and features in the columns.

Returns

torch.Tensor

Projected saturated parameters.

sincei.GetStats module

sincei.GetStats.getStats_worker(arglist)[source]

Computes statistics for each read in a bam file

This function computes statistics for each read in a bam file.

Parameters

bamfileslist

List containing the names of indexed bam files. E.g. [‘file1.bam’, ‘file2.bam’]

binSizeint

Length of the window/bin. This value is overruled by bedFile if present.

barcodeslist

list of barcodes to count the reads from.

numberOfSamplesint

Total number of samples. The genome is divided into numberOfSamples, each with a window/bin length equal to binLength. This value is overruled by stepSize in case such value is present and by bedFile in which case the number of samples and bins are defined in the bed file

numberOfProcessorsint

Number of processors to use. Default is 4

verbosebool

Output messages. Default: False

regionstr

Region to limit the computation in the form chrom:start

sincei.ParserCommon module

sincei.ParserCommon.bamOptions(args=None, suppress_args=None, default_opts=None)[source]

Arguments for tools that operate on BAM files

sincei.ParserCommon.filterOptions(args=None, suppress_args=None)[source]
sincei.ParserCommon.genomicRegion(string)[source]
sincei.ParserCommon.get_default(name, argdict)[source]
sincei.ParserCommon.inputOutputOptions(args=None, opts=None, requiredOpts=[], suppress_args=None)[source]
sincei.ParserCommon.numberOfProcessors(string)[source]
sincei.ParserCommon.otherOptions(args=None)[source]
sincei.ParserCommon.plotOptions(args=None)[source]
sincei.ParserCommon.process_args(args=None)[source]
sincei.ParserCommon.readOptions(args=None, suppress_args=None)[source]

Common arguments related to BAM files and the interpretation of the read coverage

sincei.ParserCommon.show_or_hide(help, name, arglist)[source]
sincei.ParserCommon.smartLabel(label)[source]

Remove the path name and the last extension from the file name /pth/to/file.name.bam -> file.name

sincei.ParserCommon.smartLabels(labels)[source]
sincei.ParserCommon.validateInputs(args, process_barcodes=True)[source]

Ensure that right input is provided from argparse

sincei.ReadCounter module

class sincei.ReadCounter.CountReadsPerBin(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]

Bases: object

Collects coverage over multiple bam files using multiprocessing

This function collects read counts (coverage) from several bam files and returns an numpy array with the results. This class uses multiprocessing to compute the coverage.

Parameters

bamFilesListlist

List containing the names of indexed bam files. E.g. [‘file1.bam’, ‘file2.bam’]

binLengthint

Length of the window/bin. This value is overruled by bedFile if present.

barcodeslist

list of barcodes to count the reads from.

numberOfSamplesint

Total number of samples. The genome is divided into numberOfSamples, each with a window/bin length equal to binLength. This value is overruled by stepSize in case such value is present and by bedFile in which case the number of samples and bins are defined in the bed file

numberOfProcessorsint

Number of processors to use. Default is 4

verbosebool

Output messages. Default: False

regionstr

Region to limit the computation in the form chrom:start:end.

bedFilelist of file_handles.

Each file handle corresponds to a bed file containing the regions for which to compute the coverage. This option overrules binLength, numberOfSamples and stepSize.

blackListFileNamestr

A string containing a BED file with blacklist regions.

extendReads : bool, int

Whether coverage should be computed for the extended read length (i.e. the region covered by the two mates or the regions expected to be covered by single-reads). If the value is ‘int’, then then this is interpreted as the fragment length to extend reads that are not paired. For Illumina reads, usual values are around 300. This value can be determined using the peak caller MACS2 or can be approximated by the fragment lengths computed when preparing the library for sequencing. If the value is of the variable is true and not value is given, the fragment size is sampled from the library but only if the library is paired-end. Default: False

minMappingQualityint

Reads of a mapping quality less than the give value are not considered. Default: None

duplicateFilterstr

Type of duplicate filter to use (same start, end position, umi and barcodes. If paired-end, same start-end for mates) are to be excluded. Default: None

chrToSkip: list

List with names of chromosomes that do not want to be included in the coverage computation. This is useful to remove unwanted chromosomes (e.g. ‘random’ or ‘Het’).

stepSizeint

the positions for which the coverage is computed are defined as follows: range(start, end, stepSize). Thus, a stepSize of 1, will compute the coverage at each base pair. If the stepSize is equal to the binLength then the coverage is computed for consecutive bins. If seepSize is smaller than the binLength, then teh bins will overlap.

center_readbool

Determines if reads should be centered with respect to the fragment length.

samFlag_includeint

Extracts only those reads having the SAM flag. For example, to get only reads that are the first mates a samFlag of 64 could be used. Similarly, the samFlag_include can be used to select only reads mapping on the reverse strand or to get only properly paired reads.

samFlag_excludeint

Removes reads that match the SAM flag. For example to get all reads that map to the forward strand a samFlag_exlude 16 should be used. Which translates into exclude all reads that map to the reverse strand.

zerosToNansbool

If true, zero values encountered are transformed to Nans. Default false.

skipZeroOverZerobool

If true, skip bins where all input BAM files have no coverage (only applicable to bamCompare).

minFragmentLengthint

If greater than 0, fragments below this size are excluded.

maxFragmentLengthint

If greater than 0, fragments above this size are excluded.

minAlignedFractionfloat

fragments where less than the given fraction of bases align are excluded.

motifFilterlist

Only alignments with reads containing the given motif at 5’-end and genome 5’-end are counted.

GCcontentFilterlist

Only alignments with given min and max GC content are counted.

genome2bitstr

2 bit file for the genome (if motifFilter is specified)

out_file_for_raw_datastr

File name to save the raw counts computed

statsListlist

For each BAM file in bamFilesList, the associated per-chromosome statistics returned by openBam

mappedListlist

For each BAM file in bamFilesList, the number of mapped reads in the file.

bed_and_binboolean

If true AND a bedFile is given, compute coverage of each bin of the given size in each region of bedFile

sumCoveragePerBinboolean

If true return cumulative coverage per bin, instead of total read counts (for plotFingerPrint)

genomeChunkSizeint

If not None, the length of the genome used for multiprocessing.

Returns

numpy array

Each row correspond to each bin/bed region and each column correspond to each of the bamFiles.

Examples

The test data contains reads for 200 bp.

>>> test = Tester()

The transpose function is used to get a nicer looking output. The first line corresponds to the number of reads per bin in bam file 1

>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 50, 4)
>>> np.transpose(c.run())
array([[0., 0., 1., 1.],
       [0., 1., 1., 2.]])
count_reads_in_region(chrom, start, end, bed_regions_list=None)[source]

Counts the reads in each bam file at each ‘stepSize’ position within the interval (start, end) for a window or bin of size binLength.

The stepSize controls the distance between bins. For example, a step size of 20 and a bin size of 20 will create bins next to each other. If the step size is smaller than the bin size the bins will overlap.

If a list of bedRegions is given, then the number of reads that overlaps with each region is counted.

Parameters

chromstr

Chrom name

startint

start coordinate

endint

end coordinate

barcodes: list

List of barcodes to count (currently set for tag ‘BC’ in the BAM)

bed_regions_list: list

List of list of tuples of the form (start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

Returns

numpy array

The result is a numpy array that as rows each bin and as columns each bam file.

Examples

Initialize some useful values

>>> test = Tester()
>>> c = CountReadsPerBin([test.bamFile1, test.bamFile2], 25, 0, stepSize=50)

The transpose is used to get better looking numbers. The first line corresponds to the number of reads per bin in the first bamfile.

>>> _array, __ = c.count_reads_in_region(test.chrom, 0, 200)
>>> _array
array([[0., 0.],
       [0., 1.],
       [1., 1.],
       [1., 2.]])
getReadLength(read)[source]
getSmoothRange(tileIndex, tileSize, smoothRange, maxPosition)[source]

Given a tile index position and a tile size (length), return the a new indices over a larger range, called the smoothRange. This region is centered in the tileIndex an spans on both sizes to cover the smoothRange. The smoothRange is trimmed in case it is less than zero or greater than maxPosition

---------------|==================|------------------
           tileStart
      |--------------------------------------|
      |    <--      smoothRange     -->      |
      |
tileStart - (smoothRange-tileSize)/2

Test for a smooth range that spans 3 tiles.

Examples

>>> c = CountReadsPerBin([], 1, 1, 1, 0)
>>> c.getSmoothRange(5, 1, 3, 10)
(4, 7)

Test smooth range truncated on start.

>>> c.getSmoothRange(0, 10, 30, 200)
(0, 2)

Test smooth range truncated on start.

>>> c.getSmoothRange(1, 10, 30, 4)
(0, 3)

Test smooth range truncated on end.

>>> c.getSmoothRange(5, 1, 3, 5)
(4, 5)

Test smooth range not multiple of tileSize.

>>> c.getSmoothRange(5, 10, 24, 10)
(4, 6)
get_chunk_length(bamFilesHandles, genomeSize, chromSizes, chrLengths)[source]
get_coverage_of_region(bamHandle, chrom, regions, fragmentFromRead_func=None)[source]

Returns a numpy array that corresponds to the number of reads that overlap with each tile.

>>> test = Tester()
>>> import pysam
>>> c = CountReadsPerBin([], stepSize=1, extendReads=300)

For this case the reads are length 36. The number of overlapping read fragments is 4 and 5 for the positions tested.

>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000833, 5000834), (5000834, 5000835)])
array([4., 5.])

In the following example a paired read is extended to the fragment length which is 100 The first mate starts at 5000000 and the second at 5000064. Each mate is extended to the fragment length independently At position 500090-500100 one fragment of length 100 overlap, and after position 5000101 there should be zero reads.

>>> c.zerosToNans = True
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile_PE), 'chr2',
... [(5000090, 5000100), (5000100, 5000110)])
array([ 1., nan])

In the following case the reads length is 50. Reads are not extended.

>>> c.extendReads=False
>>> c.get_coverage_of_region(pysam.AlignmentFile(test.bamFile2), '3R', [(148, 150), (150, 152), (152, 154)])
array([1., 2., 2.])
get_fragment_from_read(read)[source]

Get read start and end position of a read. If given, the reads are extended as follows: If reads are paired end, each read mate is extended to match the fragment length, otherwise, a default fragment length is used. If reads are split (give by the CIGAR string) then the multiple positions of the read are returned. When reads are extended the cigar information is skipped.

Parameters

read: pysam object.

The following values are defined (for forward reads):

     |--          -- read.tlen --              --|
     |-- read.alen --|
-----|===============>------------<==============|----
     |               |            |
read.reference_start
            read.reference_end  read.pnext

  and for reverse reads


     |--             -- read.tlen --           --|
                                 |-- read.alen --|
-----|===============>-----------<===============|----
     |                           |               |
  read.pnext           read.reference_start  read.reference_end

this is a sketch of a pair-end reads

The function returns the fragment start and end, either using the paired end information (if available) or extending the read in the appropriate direction if this is single-end.

Parameters

read : pysam read object

Returns

list of tuples

[(fragment start, fragment end)]

>>> test = Tester()
>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True)
>>> c.defaultFragmentLength=100
>>> c.get_fragment_from_read(test.getRead("paired-forward"))
[(5000000, 5000100)]
>>> c.get_fragment_from_read(test.getRead("paired-reverse"))
[(5000000, 5000100)]
>>> c.defaultFragmentLength = 200
>>> c.get_fragment_from_read(test.getRead("single-forward"))
[(5001491, 5001691)]
>>> c.get_fragment_from_read(test.getRead("single-reverse"))
[(5001536, 5001736)]
>>> c.defaultFragmentLength = 'read length'
>>> c.get_fragment_from_read(test.getRead("single-forward"))
[(5001491, 5001527)]
>>> c.defaultFragmentLength = 'read length'
>>> c.extendReads = False
>>> c.get_fragment_from_read(test.getRead("paired-forward"))
[(5000000, 5000036)]

Tests for read centering.

>>> c = CountReadsPerBin([], 1, 1, 200, extendReads=True, center_read=True)
>>> c.defaultFragmentLength = 100
>>> assert(c.get_fragment_from_read(test.getRead("paired-forward")) == [(5000032, 5000068)])
>>> c.defaultFragmentLength = 200
>>> assert(c.get_fragment_from_read(test.getRead("single-reverse")) == [(5001618, 5001654)])
static is_proper_pair(read, maxPairedFragmentLength)[source]

Checks if a read is proper pair meaning that both mates are facing each other and are in the same chromosome and are not to far away. The sam flag for proper pair can not always be trusted. Note that if the fragment size is > maxPairedFragmentLength (~2kb usually) that False will be returned. :return: bool

>>> import pysam
>>> import os
>>> from deeptools.countReadsPerBin import CountReadsPerBin as cr
>>> root = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bam = pysam.AlignmentFile("{}/test_proper_pair_filtering.bam".format(root))
>>> iter = bam.fetch()
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "keep" read
True
>>> cr.is_proper_pair(read, 200) # "keep" read, but maxPairedFragmentLength is too short
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "improper pair"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "mismatch chr"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "same orientation1"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "same orientation2"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "rev first"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "rev first OK"
True
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "for first"
False
>>> read = next(iter)
>>> cr.is_proper_pair(read, 1000) # "for first"
True
run(allArgs=None)[source]
class sincei.ReadCounter.Tester[source]

Bases: object

getRead(readType)[source]

prepare arguments for test

sincei.ReadCounter.countReadsInRegions_wrapper(args)[source]

Passes the arguments to countReadsInRegions_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the countReadsPerBin object

sincei.ReadCounter.estimateSizeFactors(m)[source]

Compute size factors in the same way as DESeq2. The inverse of that is returned, as it’s then compatible with bamCoverage.

m : a numpy ndarray

>>> m = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 10, 0], [10, 5, 100]])
>>> sf = estimateSizeFactors(m)
>>> assert(np.all(np.abs(sf - [1.305, 0.9932, 0.783]) < 1e-4))
>>> m = np.array([[0, 0], [0, 1], [1, 1], [1, 2]])
>>> sf = estimateSizeFactors(m)
>>> assert(np.all(np.abs(sf - [1.1892, 0.8409]) < 1e-4))
sincei.ReadCounter.remove_row_of_zeros(matrix)[source]

remove rows containing all zeros or all nans

sincei.RegionQuery module

sincei.RegionQuery.get_bins_by_gene(dict, gene, firstBin=False)[source]

Returns the bins for a given gene.

Parameters

dictdict

Dictionary of bins and genes.

genestr

Gene name.

firstBinbool

If true, return only the first bin of the gene.

Returns

list

List of bins.

Examples

>>> dict = {'chr1_1': [('gene1', '+'), ('gene2', '-')], 'chr1_2': [('gene1', '+')]}
>>> get_bins_by_gene(dict, 'gene1')
['chr1_1', 'chr1_2']
>>> get_bins_by_gene(dict, 'gene1', firstBin=True)
'chr1_1'
sincei.RegionQuery.get_gtf_adata_olaps(adata, gtf)[source]

Get overlaps between adata and gtf

Parameters

adataAnnData

Annotated data matrix.

gtfGTF

GTF object

Returns

dict

Dictionary with overlaps for each gene in adata.

Examples

>>> test = Tester()
>>> gtf = GTF(test.gtfFile)
>>> adata = sc.read_10x_mtx(test.input_matrix_dir, var_names='gene_symbols', cache=True)
>>> olaps=get_gtf_adata_olaps(adata, gtf)
>>> olaps['Gm37381']
[('ENSMUSG00000064372', '+'), ('ENSMUSG00000064372', '-')]

sincei.TopicModels module

class sincei.TopicModels.TOPICMODEL(mat, cells, regions, n_topics, smart_code='lfu', n_passes=1, n_workers=1)[source]

Bases: object

Computes LSA or LDA for a given matrix and returns the cell-topic matrix

Parameters

matscipy.sparse.csr_matrix

Sparse matrix of shape (cells, regions)

cellslist

List of Cell IDs (corresponding to the input matrix rows)

regionslist

List of Regions (corresponding to the input matrix columns)

n_topicsint

Number of Topics / Principal Components

smart_codestr

SMART (System for the Mechanical Analysis and Retrieval of Text) code for weighting of input matrix for TFIDF. Only valid for the LSA model. The default (“lfu”) corresponds to “log”TF * IDF, and “pivoted unique” normalization of document length. For more information, see: https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System

Returns

An object of class TOPICMODELS.

get_cell_topic(pop_sparse_cells=False)[source]

Get cell-topic matrix from the updated object

Returns

cell_topicpandas dataframe

Cell-topic matrix

runLDA()[source]

Computes LDA model for a given matrix and returns the updated object

Returns

cell_topicpandas dataframe

Cell-topic matrix

runLSA()[source]

Computes LSA for a given matrix and returns the updated object

Returns

corpus_tfidfgensim corpus

TFIDF normalized corpus

corpus_lsigensim corpus

LSA corpus

sincei.Utilities module

sincei.Utilities.checkAlignedFraction(read, lowFilter)[source]

Check whether the fraction of read length that aligns to the reference is higher than the given threshold. Aligned fraction includes the max allowed mismatches tolerated by the aligner, and excludes InDels and Clippings.

Return: Bool

sincei.Utilities.checkBAMtag(bam, name, tag)[source]
sincei.Utilities.checkGCcontent(read, lowFilter, highFilter, returnGC=False)[source]

Checks if the GC content of the read is within the given range

Parameters

readpysam.AlignedSegment

A pysam AlignedSegment object

lowFilterfloat

Minimum GC content

highFilterfloat

Maximum GC content

returnGCbool

If true, return the GC content of the read

Returns

bool

True if the GC content of the read is within the given range

Examples

>>> test = Tester()
>>> read = test.bamFile1.fetch().next()
>>> checkGCcontent(read, 0.3, 0.7)
True
sincei.Utilities.checkMotifs(read, chrom, genome, readMotif, refMotif)[source]

Check whether a given motif is present in the read and the corresponding reference genome. For example, in MNAse (scChIC-seq) data, we expect the reads to have an ‘A’ at the 5’-end, while the genome has a ‘TA’ over hang (where the ‘A’ aligns with ‘A’ in the forward read), like this below.

Forwards aligned read: read has ‘A’, upstream has T R1 ……..A——-> ———-TA————Ref (+)

Rev aligned read: read has ‘T’, downstream has A

<——-T……. R1 ——–TA————Ref (+)

This function can look for any arbitrary motif in read and corresponding genome, but in the same orientation as described above.

Returns:

bool

>>> import pysam
>>> import os
>>> from scDeepTools.scReadCounter import CountReadsPerBin as cr
>>> root = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bam = pysam.AlignmentFile("{}/test_TA_filtering.bam".format(root))
>>> iter = bam.fetch()
>>> read = next(iter)
>>> cr.checkMotifs(read, 'A', 'TA') # for valid scChIC read
True
>>> read = next(iter)
>>> cr.checkMotifs(read, 'A', 'TA') # for invalid scChIC read
False
sincei.Utilities.colorPicker(name)[source]

This function returns a list of colors for plotting.

Parameters

namestr

The name of the color palette to use.

Returns

list

A list of colors.

Examples

>>> colorPicker('twentyfive')
['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00', '#ffff33', '#a65628', '#f781bf', '#999999']
>>> colorPicker('colorblind')
['#0072B2', '#009E73', '#D55E00', '#CC79A7', '#F0E442', '#56B4E9', '#E69F00', '#000000']
sincei.Utilities.getDupFilterTuple(read, bc, filterArg)[source]

Returns a tuple with the information needed to filter duplicates, based on read and filter type. The tuple is composed of the barcode, the umi, the start and end positions and the chromosome name.

Parameters

readpysam.AlignedSegment

A pysam.AlignedSegment object

bcstr

The barcode

filterstr

A string with the type of filter to use.

Returns

tuple

A tuple with the information needed to filter duplicates. The tuple is composed of the barcode, the umi, the start and end positions and the chromosome name.

Examples

>>> test = Tester()
>>> read = test.bamFile1.fetch().next()
>>> getDupFilterTuple(read, 'ATCG', 'end_umi')
('ATCG', 'ATCG', None, None, 0, False)
sincei.Utilities.gini(i, X)[source]

Computes the Gini coefficient for each row of a sparse matrix (Obs*Var).

Parameters

iint

row index

Xnumpy array

matrix

Returns

float

Gini coefficient for the given row

Examples

>>> X = np.matrix([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
>>> gini(0, X)
0.0
>>> gini(1, X)
0.0
>>> gini(2, X)
0.0

sincei.WriteBedGraph module

class sincei.WriteBedGraph.WriteBedGraph(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]

Bases: CountReadsPerBin

Reads bam files coverages and writes a bedgraph or bigwig file

Extends the CountReadsPerBin object such that the coverage of bam files is writen to multiple bedgraph files at once.

The bedgraph files are later merge into one and converted into a bigwig file if necessary.

The constructor arguments are the same as for CountReadsPerBin. However, when calling the run method, the following parameters have to be passed

Examples

Given the following distribution of reads that cover 200 on a chromosome named ‘3R’:

  0                              100                           200
  |------------------------------------------------------------|
A                                ===============
                                                ===============


B                 ===============               ===============
                                 ===============
                                                ===============
>>> import tempfile
>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> outFile = tempfile.NamedTemporaryFile()
>>> bam_file = test_path +  "testA.bam"

For the example a simple scaling function is going to be used. This function takes the coverage found at each region and multiplies it to the scaling factor. In this case the scaling factor is 1.5

>>> function_to_call = scaleCoverage
>>> funcArgs = {'scaleFactor': 1.5}

Restrict process to a region between positions 0 and 200 of chromosome 3R

>>> region = '3R:0:200'

Set up such that coverage is computed for consecutive bins of length 25 bp >>> bin_length = 25 >>> step_size = 25

>>> num_sample_sites = 0 #overruled by step_size
>>> c = WriteBedGraph([bam_file], binLength=bin_length, region=region, stepSize=step_size)
>>> c.run(function_to_call, funcArgs, outFile.name)
>>> f = open(outFile.name, 'r')
>>> f.readlines()
['3R\t0\t100\t0\n', '3R\t100\t200\t1.5\n']
>>> f.close()
>>> outFile.close()
run(func_to_call, func_args, out_file_prefix, blackListFileName=None, format='bedgraph', smoothLength=0, normUsing=None)[source]

Given a list of bamfiles, a function and a function arguments, this method writes a bedgraph file (or bigwig) file for a partition of the genome into tiles of given size and a value for each tile that corresponds to the given function and that is related to the coverage underlying the tile.

Parameters

func_to_callstr

function name to be called to convert the list of coverages computed for each bam file at each position into a single value.

func_argsdict

dict of arguments to pass to func. E.g. {‘scaleFactor’:1.0}

out_file_prefixstr

name of the file to save the resulting data.

writeBedGraph_worker(chrom, start, end, func_to_call, func_args, bed_regions_list=None)[source]

Writes a bedgraph based on the read coverage per group of cells, indicated by cluster_info data frame.

The given func is called to compute the desired bedgraph value using the funcArgs

Parameters

chromstr

Chrom name

startint

start coordinate

endint

end coordinate

func_to_callstr

function name to be called to convert the list of coverages computed for each bam file at each position into a single value. An example is a function that takes the ratio between the coverage of two bam files.

func_argsdict

dict of arguments to pass to func.

smoothLengthint

Distance in bp for smoothing the coverage per tile.

bed_regions_list: list

List of tuples of the form (chrom, start, end) corresponding to bed regions to be processed. If not bed file was passed to the object constructor then this list is empty.

Returns

A list of [chromosome, start, end, temporary file], where the temporary file contains the bedgraph results for the region queried.

Examples

>>> test_path = os.path.dirname(os.path.abspath(__file__)) + "/test/test_data/"
>>> bamFile1 = test_path +  "testA.bam"
>>> bin_length = 50
>>> number_of_samples = 0 # overruled by step_size
>>> func_to_call = scaleCoverage
>>> funcArgs = {'scaleFactor': 1.0}
>>> c = WriteBedGraph([bamFile1], bin_length, number_of_samples, stepSize=50)
>>> tempFile = c.writeBedGraph_worker( '3R', 0, 200, func_to_call, funcArgs)
>>> f = open(tempFile[3], 'r')
>>> f.readlines()
['3R\t0\t100\t0\n', '3R\t100\t200\t1\n']
>>> f.close()
>>> os.remove(tempFile[3])
sincei.WriteBedGraph.scaleCoverage(tile_coverage, args)[source]

Return coverage per cluster as sum of cells. tileCoverage should be an list with only one element

sincei.WriteBedGraph.writeBedGraph_wrapper(args)[source]

Passes the arguments to writeBedGraph_worker. This is a step required given the constrains from the multiprocessing module. The args var, contains as first element the ‘self’ value from the WriteBedGraph object

sincei.multimodalClustering module

sincei.multimodalClustering.multiModal_clustering(mode1_adata, mode2_adata, column_key='barcode_nla', nK=20)[source]

Performs multi-graph clustering on matched keys(barcodes) bw two anndata objects.

Parameters

mode1_adataAnnData

AnnData object for mode 1

mode2_adataAnnData

AnnData object for mode 2

column_keystr

Column name for the barcode in mode 1 and 2

nKint

Number of clusters to use for clustering

Returns

multi_umapDataFrame

DataFrame with UMAP coordinates and cluster labels

mode1_adataAnnData

AnnData object for mode 1

mode2_adataAnnData

AnnData object for mode 2

sincei.multimodalClustering.umap_aligned(pca_mode1, pca_mode2, nK=15, distance_metric='eucledian')[source]

Aligns two UMAP embeddings using the UMAP AlignedUMAP class

Parameters

pca_mode1pandas.DataFrame

UMAP embedding of RNA data

pca_mode2pandas.DataFrame

UMAP embedding of CHiC data

nKint

Number of nearest neighbors to use for UMAP

distance_metricstr

Distance metric to use for UMAP

Returns

pandas.DataFrame

Aligned UMAP embedding of RNA data

pandas.DataFrame

Aligned UMAP embedding of CHiC data

Examples

>>> test = Tester()
>>> pca_mode1 = test.pca_mode1
>>> pca_mode2 = test.pca_mode2
>>> umap_aligned(pca_mode1, pca_mode2)
(                 aligned_UMAP1  aligned_UMAP2
0  -0.012995  0.001206
1   0.

sincei.scBAMops module

sincei.scBAMops.convertBED(oname, tmpFiles, chromDict)[source]

Stores results in BEDPE format, which is: chromosome frag_leftend frag_rightend

The fragment ends can be shifted

sincei.scBAMops.filterWorker(arglist)[source]
sincei.scBAMops.get_args()[source]
sincei.scBAMops.main(args=None)[source]
sincei.scBAMops.parseArguments()[source]
sincei.scBAMops.shiftRead(b, chromDict, args)[source]

sincei.scBulkCoverage module

class sincei.scBulkCoverage.CenterFragment(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]

Bases: WriteBedGraph

Class to redefine the get_fragment_from_read for the –MNase case

The coverage of the fragment is defined as the 2 or 3 basepairs at the center of the fragment length.

get_fragment_from_read(read)[source]

Takes a proper pair fragment of high quality and limited to a certain length and outputs the center

class sincei.scBulkCoverage.OffsetFragment(bamFilesList, binLength=50, barcodes=None, cellTag=None, groupTag=None, groupLabels=None, clusterInfo=None, motifFilter=None, genome2bit=None, GCcontentFilter=None, numberOfSamples=None, numberOfProcessors=1, verbose=False, region=None, bedFile=None, extendReads=False, genomeChunkSize=None, blackListFileName=None, minMappingQuality=None, duplicateFilter=None, chrsToSkip=[], stepSize=None, center_read=False, samFlag_include=None, samFlag_exclude=None, zerosToNans=False, skipZeroOverZero=False, smoothLength=0, minFragmentLength=0, maxFragmentLength=0, minAlignedFraction=0, out_file_for_raw_data=None, bed_and_bin=False, sumCoveragePerBin=False, binarizeCoverage=False, statsList=[], mappedList=[])[source]

Bases: WriteBedGraph

Class to redefine the get_fragment_from_read for the –Offset case

filterStrand(read, rv)[source]

A generic read filtering function that gets used by everything in this class.

rv is returned if the strand is correct, otherwise [(None, None)]

get_fragment_from_read(read)[source]

This is mostly a wrapper for self.get_fragment_from_read_list(), which needs a list and for the offsets to be tweaked by 1.

get_fragment_from_read_list(read, offset)[source]

Return the range of exons from the 0th through 1st bases, inclusive. Positions are 1-based

sincei.scBulkCoverage.get_args()[source]
sincei.scBulkCoverage.main(args=None)[source]
sincei.scBulkCoverage.parseArguments()[source]

sincei.scClusterCells module

sincei.scClusterCells.get_args()[source]
sincei.scClusterCells.main(args=None)[source]
sincei.scClusterCells.parseArguments()[source]

sincei.scCombineCounts module

sincei.scCombineCounts.get_args()[source]
sincei.scCombineCounts.main(args=None)[source]
sincei.scCombineCounts.parseArguments()[source]

sincei.scCountQC module

sincei.scCountQC.filter_adata(adata, filter_region_dict=None, filter_cell_dict=None, bad_chrom=None, bad_cells=None)[source]
sincei.scCountQC.get_args()[source]
sincei.scCountQC.main(args=None)[source]
sincei.scCountQC.parseArguments()[source]

sincei.scCountReads module

sincei.scCountReads.get_args()[source]
sincei.scCountReads.main(args=None)[source]

1. get read counts at different positions either all of same length or from genomic regions from the BED file

  1. save data for further plotting

sincei.scCountReads.parseArguments(args=None)[source]

sincei.scFilterBarcodes module

sincei.scFilterBarcodes.count_occurrences(res)[source]
sincei.scFilterBarcodes.getFiltered_worker(arglist)[source]
sincei.scFilterBarcodes.get_args()[source]
sincei.scFilterBarcodes.ham_dist(s1, s2)[source]
sincei.scFilterBarcodes.main(args=None)[source]
sincei.scFilterBarcodes.parseArguments()[source]

sincei.scFilterStats module

sincei.scFilterStats.getFiltered_worker(arglist)[source]
sincei.scFilterStats.main(args=None)[source]
sincei.scFilterStats.parseArguments()[source]

sincei.scJSD module

sincei.scJSD.get_args()[source]
sincei.scJSD.main(args=None)[source]
sincei.scJSD.parseArguments(args=None)[source]

sincei.sincei module

sincei.sincei.main(args=None)[source]
sincei.sincei.parse_arguments(args=None)[source]
sincei.sincei.process_args(args=None)[source]

Module contents