sincei.GLMPCA module#

class sincei.GLMPCA.GLMPCA(n_pc, family='gaussian', family_params=None, max_iter=100, learning_rate=0.2, batch_size=256, step_size=20, gamma=0.5, n_init=1, init='spectral', n_jobs=1)[source]#

Performs GLM-PCA on a data matrix to reduce its dimensionality.

This class computes the generalized-linear model principal components (GLM-PCs) of a dataset by exploiting the framework of saturated parameters. Specifically, given an exponential distribution chosen based on prior knowledge, GLM-PCA will find a collection of directions which minimize the reconstruction error, computed as the negative log-likelihood of the chosen exponential distribution.

By making use of an alternative formulation, our implementation can exploit automatic differentiation and can therefore rely on mini-batch Stochastic Gradient Descent. As a consequence, it scales to large datasets.

Another interesting feature of our implementation is that it does not require any cumbersome Lagrangian derivations. If you wish to test an exponential family distribution not present in our implementation, you may add it by creating a subclass of sincei.ExponentialFamily.ExponentialFamily, with its corresponding density function. This would suffice to use it for GLM-PCA.

Parameters#

n_pcint

Number of principal components to compute.

family: str

Name of the exponential distribution to use. Possible families: "gaussian", "poisson", "bernoulli", "beta", "gamma", "log_normal", "log_beta", "sigmoid_beta". Defaults to "gaussian".

family_paramsdict

Dictionary with additional exponential distribution parameters. The list of parameters depends on the specific ExponentialFamily class chosen.

"n_jobs" (int) for parallelization, specifically for "beta" and "gamma".
"min_val" (float) for truncating in "poisson" or "beta".
"eps" (float) for convergence in inverse computation in "beta".

Defaults to None.

max_iterint

Maximum number of epochs in the GLM-PCA optimisation. Defaults to 100.

learning_rate: float

Learning rate to be used in the GLM-PCA optimisation. If learning_rate is too high and lead to NaN, our implementation automatically restarts the optimisation with a smaller value. Defaults to 0.2.

batch_sizeint

Size of the batch in the SGD optimisation step. Defaults to 256.

step_size: int

Step size in optimiser scheduler. Defaults to 20. See more: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html

gamma: int

Reduction parameter for optimiser scheduler. Defaults to 0.5. See more: https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html

n_init: int

Number of GLM-PCA initializations. Useful if you want to explore different random seeds and starting points. Defaults to 1.

init: str

Method to initialize loadings. "spectral" performs SVD on the saturated parameters from a small random batch of the dataset, "random" performs a random initialization on the Stiefel manifold. Defaults to "spectral".

n_jobs: int

Number of jobs used in parallel operations. Defaults to 1.

fit(X)[source]#

Fits a GLM-PCA to a specific dataset.

Parameters#

Xtorch.Tensor, np.ndarray or AnnData: Dataset with cells in rows and features in columns.

Returns#

bool: Returns True if the fitting procedure was been successful.

transform(X)[source]#

Transforms and projects dataset X onto the principal components.

Parameters#

Xtorch.Tensor or np.ndarray: Dataset with cells in rows and features in columns.

Returns#

torch.Tensor: Projected saturated parameters.

sincei.GLMPCA module

Contents

sincei.GLMPCA module#

Parameters#

Parameters#

Returns#

Parameters#

Returns#