sincei.VCRfinder module

sincei.VCRfinder module#

sincei.VCRfinder.sparse_band_corr(X, k, chrom=None, verbose=True)[source]#

Compute only the first k diagonals of the correlation matrix of X, stored in banded format. Works directly on sparse matrices.

Parameters#

Xscipy.sparse matrix or np.ndarray

Input data matrix of shape (n_samples, n_features).

kint

Number of diagonals to compute (bandwidth).

chromstr or None, optional

Chromosome name for progress bar description.

verbosebool, optional

Whether to display progress bars.

Returns#

band_corrnp.ndarray, shape (2*k+1, n_features)

Banded correlation matrix.

sincei.VCRfinder.distance_kernel(sigma, truncate=4.0, radius=None)[source]#

Create a square Gaussian distance kernel.

Parameters#

sigmafloat

Standard deviation of the Gaussian.

truncatefloat, optional

Truncate the kernel at this many standard deviations. Default is 4.0.

radiusint, optional

Radius of the kernel. If None, it is set to int(truncate * sigma).

Returns#

kernel2D numpy array

The Gaussian distance kernel.

sincei.VCRfinder.VCRfinder(adata, binsize, max_region, n_kernels=20, penalties=[1], region=None, verbose=False, n_threads=1)[source]#

Detects variable chromatin regions (VCRs) from a anndata object containing genomic signal data in equally sized bins (see scCountReads) .

First, a bin-to-bin correlation matrix is computed for each chromosome.

Then, the correlation matrix is turned into a score map by convolving a number of square Gaussian kernels along its main diagonal. Each kernel has a sigma calculated using. Each kernel produces a 1-D score for each bin, which are stacked into a matrix where each row corresponds to a kernel scale and each column to a bin.

Finally, the PELT change-point detection algorithm is applied to the score map to identify regions with distinct correlation patterns. This step depends on a penalty parameter that controls the number of detected regions.

The function returns a pandas DataFrame containing the detected variable chromatin regions at each penalty. The DataFrame has columns: 'penalty', 'chrom', 'start', 'end'.

Parameters#

adataanndata.AnnData

Input anndata object with binned chromatin data. adata.var must contain 'chrom', 'start', and 'end' columns.

binsizeint

Size of the bins in base pairs.

max_regionint

Size of the largest kernel in base pairs.

n_kernelsint, optional

Number of Gaussian kernels to use for convolution. Default is 20.

penaltieslist of float, optional

List of penalty values for the change-point detection algorithm. Default is [1].

regionstr, optional

Genomic region to limit the analysis to (e.g., 'chr1:100000:200000'). Default is None.

verbosebool, optional

Print progress messages and warnings. Default is False.

n_threadsint, optional

Number of threads to use for parallel processing, by default 1.

Returns#

outputpd.DataFrame

Output DataFrame with detected variable chromatin regions at each penalty.