sincei.FeatureScorer module

sincei.FeatureScorer module#

sincei.FeatureScorer.get_indices_overlapping(adata, chrom, start, end)[source]#

This function takes an AnnData object and a region defined by chromosome, start, and end positions. It returns the overlap indices of features overlapping with the region.

Parameters#

adataAnnData

The input AnnData object containing the data.

chromstr

The chromosome of the region.

startint

The start position of the region.

endint

The end position of the region.

Returns#

overlap_indicesnp.ndarray or None

Array of global feature indices that overlap with the region, or None if no overlaps.

sincei.FeatureScorer.get_decay_weights(gene_start, gene_end, feature_starts, feature_ends, strand='+', decay=0.75, gene_body=True, excluded_regions=[])[source]#

This function computes a vector of weights for calculating the gene activity of a particular gene in a given region. The weights are the average exponential decay weight across each feature body, assuming uniform count distribution within features. Features in excluded_regions are assigned a weight of 0.

The weights are computed as the average of: np.exp(-decay * distance / 10000) across each feature.

Parameters#

gene_startint

The start position of the gene of interest.

gene_endint

The end position of the gene of interest.

feature_startsnp.ndarray

Array of feature start positions.

feature_endsnp.ndarray

Array of feature end positions.

strandstr, optional

The strand of the gene ('+' or '-'), by default '+'.

decayfloat, optional

Decay parameter for weighting, by default 1.0. Higher values lead to faster decay.

gene_bodybool, optional

Whether the weight of the gene body is considered as 1 like the TSS, by default True. If True, the decay starts beyond the gene body.

excluded_regionslist of tuples, optional

List of (start, end) tuples defining regions to exclude from contributing to the activity score (weight 0).

Returns#

weightsnp.ndarray

Array of average weights for each feature.

sincei.FeatureScorer.FeatureScorer(adata, gtf, mode, overlap_policy='partial', penalty=None, decay=0.75, max_region=100, gene_body=True, gene_size_factor=True, exclude_in_range=None, center_scores=False, verbose=False, n_threads=1)[source]#

This function calculates a cell x gene matrix with gene activity scores. First, it parses the input BED/GTF file to get gene/feature annotations, then it identifies the relevant genomic region (including upstream/downstream regions if specified), retrieves the counts of features overlapping with that region, applies decay weights if specified, computes the weighted sum of counts to obtain the gene activity scores for each cell, and L1-normalizes the scores row-wise (per cell).

Parameters#

adataAnnData

The input AnnData object containing the data.

gtfstr

Path to the BED/GTF file with region annotations.

modestr

Scoring mode. Options are 'aggregate' or 'activities'. aggregate calculates the total counts of the genomic features in the input BED/GTF file from the input anndata. activities mode calculates the weighted sum of counts based on distance to TSS of the genes in the input GTF file. The weights are calculated using an exponential decay function.

overlap_policy: str, optional

Policy for handling adata features that only partially overlap regions in the BED/GTF provided. Options are:

  • partial: count reads in anndata feature proportionally to the overlap fraction. counts_considered = feature_counts * overlap_length / region_length.

  • all: count all reads in the partially overlapping anndata feature.

  • none: exclude reads from partially overlapping anndata features, in other words, only count reads in anndata features fully contained within BED/GTF regions.

Default is 'partial'.

center_scoresbool, optional

Whether to scale the scores to unit variance and center them around zero, by default False. This destroys the sparsity of the output matrix and can lead to increased memory usage. Use with caution for large datasets.

penaltyfloat, optional

Optional parameter to select VCRs of a particular penalty value from a BED file with VCRs calculated using multiple penalties.

decayfloat, optional

Decay parameter for calculating the decay weights, by default 0.75. Higher values lead to faster decay. Weights are calculated as exp(-decay * distance_in_kb / 10). This parameter is ignored in aggregate mode.

max_regionint, optional

Maximum region size around the gene (upstream and downstream) to consider (in kilobases), by default 100 Kb.

gene_bodybool, optional

Whether the weight of the gene body is considered as 1 like the TSS, by default True. If True, the decay starts beyond the gene body.

gene_size_factorbool, optional

Whether to divide scores by gene length to account for gene length bias, by default True.

exclude_in_rangestr, optional

Whether to exclude regions of other genes from contributing to this gene's activity score. Options are:

  • None: No exclusion (default)

  • "TSS": Exclude features overlapping the TSS of other genes

  • "genes": Exclude features overlapping the bodies of other genes

Invalid values default to None.

center_scoresbool, optional

Whether to scale the scores to unit variance and center them around zero, by default False. This destroys the sparsity of the output matrix and can lead to increased memory usage. Use with caution for large datasets.

verbosebool, optional

Print progress messages and warnings. Default is False.

n_threadsint, optional

Number of threads to use for parallel processing, by default 1.

Returns#

adata_outAnnData

AnnData object with cells as obs and genes as var, containing gene activity scores.