sincei.FeatureScorer module#
- sincei.FeatureScorer.get_indices_overlapping(adata, chrom, start, end)[source]#
This function takes an AnnData object and a region defined by chromosome, start, and end positions. It returns the overlap indices of features overlapping with the region.
Parameters#
- adataAnnData
The input AnnData object containing the data.
- chromstr
The chromosome of the region.
- startint
The start position of the region.
- endint
The end position of the region.
Returns#
- overlap_indicesnp.ndarray or None
Array of global feature indices that overlap with the region, or None if no overlaps.
- sincei.FeatureScorer.get_decay_weights(gene_start, gene_end, feature_starts, feature_ends, strand='+', decay=0.75, gene_body=True, excluded_regions=[])[source]#
This function computes a vector of weights for calculating the gene activity of a particular gene in a given region. The weights are the average exponential decay weight across each feature body, assuming uniform count distribution within features. Features in
excluded_regionsare assigned a weight of 0.The weights are computed as the average of: np.exp(-decay * distance / 10000) across each feature.
Parameters#
- gene_startint
The start position of the gene of interest.
- gene_endint
The end position of the gene of interest.
- feature_startsnp.ndarray
Array of feature start positions.
- feature_endsnp.ndarray
Array of feature end positions.
- strandstr, optional
The strand of the gene ('+' or '-'), by default '+'.
- decayfloat, optional
Decay parameter for weighting, by default 1.0. Higher values lead to faster decay.
- gene_bodybool, optional
Whether the weight of the gene body is considered as 1 like the TSS, by default True. If True, the decay starts beyond the gene body.
- excluded_regionslist of tuples, optional
List of (start, end) tuples defining regions to exclude from contributing to the activity score (weight 0).
Returns#
- weightsnp.ndarray
Array of average weights for each feature.
- sincei.FeatureScorer.FeatureScorer(adata, gtf, mode, overlap_policy='partial', penalty=None, decay=0.75, max_region=100, gene_body=True, gene_size_factor=True, exclude_in_range=None, center_scores=False, verbose=False, n_threads=1)[source]#
This function calculates a cell x gene matrix with gene activity scores. First, it parses the input BED/GTF file to get gene/feature annotations, then it identifies the relevant genomic region (including upstream/downstream regions if specified), retrieves the counts of features overlapping with that region, applies decay weights if specified, computes the weighted sum of counts to obtain the gene activity scores for each cell, and L1-normalizes the scores row-wise (per cell).
Parameters#
- adataAnnData
The input AnnData object containing the data.
- gtfstr
Path to the BED/GTF file with region annotations.
- modestr
Scoring mode. Options are 'aggregate' or 'activities'.
aggregatecalculates the total counts of the genomic features in the input BED/GTF file from the input anndata.activitiesmode calculates the weighted sum of counts based on distance to TSS of the genes in the input GTF file. The weights are calculated using an exponential decay function.- overlap_policy: str, optional
Policy for handling adata features that only partially overlap regions in the BED/GTF provided. Options are:
partial: count reads in anndata feature proportionally to the overlap fraction. counts_considered = feature_counts * overlap_length / region_length.all: count all reads in the partially overlapping anndata feature.none: exclude reads from partially overlapping anndata features, in other words, only count reads in anndata features fully contained within BED/GTF regions.
Default is 'partial'.
- center_scoresbool, optional
Whether to scale the scores to unit variance and center them around zero, by default False. This destroys the sparsity of the output matrix and can lead to increased memory usage. Use with caution for large datasets.
- penaltyfloat, optional
Optional parameter to select VCRs of a particular penalty value from a BED file with VCRs calculated using multiple penalties.
- decayfloat, optional
Decay parameter for calculating the decay weights, by default 0.75. Higher values lead to faster decay. Weights are calculated as
exp(-decay * distance_in_kb / 10). This parameter is ignored inaggregatemode.- max_regionint, optional
Maximum region size around the gene (upstream and downstream) to consider (in kilobases), by default 100 Kb.
- gene_bodybool, optional
Whether the weight of the gene body is considered as 1 like the TSS, by default True. If True, the decay starts beyond the gene body.
- gene_size_factorbool, optional
Whether to divide scores by gene length to account for gene length bias, by default True.
- exclude_in_rangestr, optional
Whether to exclude regions of other genes from contributing to this gene's activity score. Options are:
None: No exclusion (default)
"TSS": Exclude features overlapping the TSS of other genes
"genes": Exclude features overlapping the bodies of other genes
Invalid values default to None.
- center_scoresbool, optional
Whether to scale the scores to unit variance and center them around zero, by default False. This destroys the sparsity of the output matrix and can lead to increased memory usage. Use with caution for large datasets.
- verbosebool, optional
Print progress messages and warnings. Default is False.
- n_threadsint, optional
Number of threads to use for parallel processing, by default 1.
Returns#
- adata_outAnnData
AnnData object with cells as obs and genes as var, containing gene activity scores.