seak
The construct_background_kernel module
Computes a background kernel matrix \(K_0\) or respective genotype matrix \(G_0\) from
binary PLINK 1 files (GRMLoader) or
loads a precomputed matrix \(K_0\) from a NumPy file (GRM_from_file).
Needed, if a linear mixed model is used to correct for confounding. In low rank case (\(n>k\)): loads a genotype matrix \(G_0\). In full rank case (\(n<k\)): computes background kernel \(K_0\). With \(n:=\) number of individuals and \(k:=\) number of SNVs.
GRM: genetic relatedness matrix.
- class seak.construct_background_kernel.GRMLoaderSnpReader(path_or_bed, blocksize, LOCO_chrom_id=None, forcelowrank=False)[source]
Bases:
objectConstructs a background kernel \(K_0\) from given binary PLINK 1 genotype files using the leave-one-out-chromosome (LOCO) strategy.
Initially no background kernel is constructed, only the instance attributes are initialized. The kernel gets constructed when calling the method
compute_background_kernel()which should only be called after calling theupdate_ind()method manually or theseak.data_loaders.intersect_and_update_datasets(). This way, individuals which are neither contained in the test nor in the background kernel data set are excluded.In full rank case, loads the SNPs in blocks to construct the kernel. In low rank case, loads all SNPs into memory at once.
- Parameters:
path_to_plink_files_with_prefix (str) – path prefix to genotype PLINK files for background kernel construction
blocksize (int) – how many genotypes to load at once; should be chosen dependent on RAM available
LOCO_chrom_id (str/int) – identifier of the chromosome/region that is used in the respective test set and should be excluded from the background kernel or None if all variants should be included
forcelowrank (bool) – enforce low rank data loading behavior for testing purposes
Note
The leave-one-chromosome-out (LOCO) strategy can be disabled with
LOCO_chrom_id.- compute_background_kernel()[source]
Computes background kernel \(K_0\) for given set of genotypes (binary PLINK 1 files).
Overlap with data of set to be tested should have been carried out before, such that individuals in both data sets match. Does not return anything but sets instance attributes for either the background kernel \(K_0\) or the background kernel genotype matrix \(G_0\).
- class seak.construct_background_kernel.GRMLoader_from_file(path_to_GRM_npy, forcelowrank=False)[source]
Bases:
objectLoad a precomputed GRM from a NumPy file (.npy or .npz).
Must either be of low rank \(n>k\) or a square \(nxn\) matrix with \(n:=\) number of individuals and \(k:=\) number of SNVs.
The data_loaders module
Contains all data IO and processing functionalities to read and manipulate genotype, covariate and variant effect prediction data.
Implements interfaces for all data types such that you can write your custom data loading classes (VariantLoader, AnnotationLoader, CovariatesLoader).
Accepted input file format for genotype data is the binary PLINK 1 format.
Accepted input file format for covariate and phenotype data is the CSV format.
Accepted input file format for variant effect predictions is the HDF5 format, following the syntax of the output of
Janggu.predict_variant_effect() (Janggu documentation) which needs to be accompanied by a index file in TSV format.
Note
The program assumes that genotype files are 1-based and fully closed, and variant effect prediction files 0-based and half-open.
- class seak.data_loaders.AnnotationLoader[source]
Bases:
objectInterface for loading variant annotations that implements data preprocessing methods.
- anno_by_id(vids)[source]
Retrieves annotations for a set of variant ids (
vids).The order will match that of
vids.- Parameters:
vids (pandas.Index) – names of variants to retrieve annotations for
- Returns:
variant effect predictions for given variants
- Return type:
numpy.ndarray
- Raises:
NotImplementedError – Interface.
- get_vids()[source]
Returns all variant ids.
- Returns:
Returns all variant ids.
- Return type:
pandas.Index
- Raises:
NotImplementedError – Interface.
- update_variants(vids=None, coordinates=None, exclude=False)[source]
Set variants to include/exclude based on variant ids (
vids) or genetic coordinates (coordinates).When
exclude== False (default), this also changes the order to match the order ofvids.- Parameters:
vids (pandas.Index) – Variant ids to include/exclude
coordinates (dict) – a dictionary of genomic coordinates {“chr”: str, “start”: int, “end”: int, “name”: str}
- Raises:
NotImplementedError – Interface.
- class seak.data_loaders.BEDRegionLoader(path_to_regions_UCSC_BED, chrom_to_load=None, drop_non_numeric_chromosomes=False)[source]
Bases:
RegionLoaderReads all regions in a UCSC BED-formatted file into memory.
- Variables:
regions (pandas.DataFrame) – DataFrame storing regions in memory {“chr”: str, “start”: int, “end”: int, “name”: str}
- regions
- class seak.data_loaders.CovariatesLoader[source]
Bases:
objectStores covariates (fixed effects) and phenotypes.
- class seak.data_loaders.CovariatesLoaderCSV(phenotype_of_interest, path_to_covariates, covariate_column_names, sep=',', path_to_phenotypes=None)[source]
Bases:
CovariatesLoaderStores covariates (fixed effects) and phenotypes.
Initializes instance of a CovariatesLoader class from a CSV file.
- Parameters:
phenotype_of_interest (str) – column name of phenotype that should be loaded
path_to_covariates (str) – path to CSV file with phenotype (if path_to_phenotypes is None) and covariates
covariate_column_names (list(str)) – list of the covariate column names
path_to_phenotypes (str) – path to CSV file with phenotype (if stored separately from covariates)
- cov
- covariate_column_names
- get_one_hot_covariates_and_phenotype(test_type, pheno=None)[source]
Returns numpy.ndarray of the phenotype and one hot encoded covariates with invariant covariates removed and a bias column.
Make sure
update_cov()was called beforehand, such thatself.covhas the same order as the genotypes. Internally calls functionpandas.get_dummies()which converts categorical variable into indicator variables.- Parameters:
test_type (string) – Either ‘logit’, ‘2K’ or ‘noK’
- Returns:
data for phenotype and one hot encoded covariates with invariant covariates removed and bias column
- Return type:
numpy.ndarray
- phenotype_of_interest
- class seak.data_loaders.EnsemblVEPLoader(uploaded_variation, location, gene, allele=None, data=None)[source]
Bases:
AnnotationLoaderCreates an AnnotationLoader from outputs columns of the Ensembl Variant Effect Predictor (VEP).
Specifically the columns ‘Uploaded_variation’, ‘Location’, ‘Allele’ and ‘Gene’. Additionally will take a numpy array or pandas dataframe with effects for those variants.
The format of the columns are described on: https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#defaultout
Uploaded variation - as chromosome_start_alleles Location - in standard coordinate format (chr:start or chr:start-end) Allele - the variant allele used to calculate the consequence Gene - Ensembl stable ID of affected gene
Positions in “Location” are one-based. “Uploaded_variation” is used as the variant ID.
- Parameters:
uploaded_variation – Series containing “Uploaded variation”
location – Series containing “Location”
allele – Series containing “Allele”
gene – Series containing “Gene”
data – DataFrame or ndarray containing the variant effect predictions
- anno_by_id(vids=None, gene=None)[source]
Retrieves gene-specific variant effect predictions based on variant IDs (vids) and gene
- Parameters:
vids – variant ids
gene – gene ids
default – if not None, default value to return when no variant effect predictions are found for a requested variant.
- anno_by_interval(coordinates, gene=None)[source]
Retrieves gene-specific variant effect predictions based on position and gene
- Parameters:
coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int}
gene – gene id
- get_vids()[source]
Returns all variant ids.
- Returns:
Returns all variant ids.
- Return type:
pandas.Index
- Raises:
NotImplementedError – Interface.
- pos_df
- update_variants(vids=None, coordinates=None, exclude=False)[source]
Set variants to include/exclude based on variant ids (
vids) or genetic coordinates (coordinates).- Parameters:
vids (pandas.Index) – Variant ids to include/exclude
coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int} to include/exclude
- vep_df
- class seak.data_loaders.Hdf5Loader(path_to_vep_bed, path_to_vep_hdf5, hdf5_key, from_janggu=False)[source]
Bases:
AnnotationLoaderLoads variant annotations from HDF5 files accompanied by a UCSC BED file or TSV with variant information.
The HDF5 files need to follow the syntax of the output of
Janggu.predict_variant_effect()(Janggu documentation). You need to specify a key to the data. Instances of the class still needs to be overlapped with respective genotypes and vice versa. UCSC BED files/or TSV require first three fields: chrom, chromStart, chromEnd; we also require field name, that needs to match the names of the genotype data. Above that: if TSV and not USCS BED file is given make sure the genomic coordinates are 0-based an half-open.- Parameters:
path_to_vep_bed (str) – path to UCSC BED file containing meta info of variant effect predictions
path_to_vep_hdf5 (str) – path to HDF5 file containing the variant effect predictions
hdf5_key (str) – name of variant effect predict to load
from_janggu (bool) – whether janggu was used to generate the veps, if yes, naming convention is presupposed
- anno_by_id(vids, shuffle=None)[source]
Retrieves annotations for a set of variant ids (
vids).The order will match that of vids.
- Parameters:
vids (pandas.Index) – names of variants to retrieve annotations for
shuffle (int) – For testing purpose only; default None; shuffles loaded veps along axis 0 or 1
- Returns:
variant effect predictions for given variants
- Return type:
numpy.ndarray
- from_janggu
- set_mask(mask)[source]
Sets a boolean mask along the annotation-axis to be applied when loading data. Can be used to sub-set annotations.
- Parameters:
mask (np.ndarray) – boolean mask to be applied to the columns (second dimension) of the annotations loaded.
- update_variants(vids=None, coordinates=None, exclude=False)[source]
Set variants to include/exclude based on variant ids (
vids) or genetic coordinates (coordinates).When
exclude== False (default), this also changes the order to match the order invids. Make sure that the vids are in the order of the veps in the respective index file if working with HDF5 files. –> Reason: HDF5 files can only be accessed indexing elements in increasing order!- Parameters:
vids (pandas.Index) – Variant ids to include/exclude
coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int} to include/exclude
- veps
- veps_index_df
- class seak.data_loaders.RegionLoader[source]
Bases:
objectInterface for loading genomic regions whose subclasses can be used to group variants into sets to test jointly.
Subclasses need only implement __iter__, which yields 0-based, right-open (->left-closed) dictionaries of genomic intervals {“chr”: str, “start”: int, “end”: int, “name”: str}
- class seak.data_loaders.VariantLoader[source]
Bases:
objectInterface for loading genotype data for variants that implements data preprocessing methods.
- static center(X, inplace=False)[source]
Mean centers genotype values, excluding missing values from computation.
- Parameters:
X (numpy.ndarray) – 2D array with dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.
- Returns:
mean-centered array
- Return type:
numpy.ndarray
- genotypes_by_id(vids)[source]
Retrieves genotypes for a set of variant ids (
vids).The ordering of the individuals (rows, axis 0) matches that returned by
get_iids(). The ordering of the variants (columns, axis 1) matches that returned byget_vids(). To change the default ordering of variants useupdate_individuals().- Parameters:
vids (pandas.Index) – one or more variant ids to retrieve genotypes for
- Returns:
tuple of genotypes for region of interest (numpy.ndarray (\(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs)); and pandas.DataFrame with names/vids of respective genotypes as index
- Return type:
(numpy.ndarray, pandas.Index)
- Raises:
NotImplementedError: Interface.
- genotypes_by_region(coordinates)[source]
Given a region of interest returns the variants that lie in the respective region.
The order of the individuals (rows, axis 0) matches that returned by
get_iids(). The order of the variants (columns, axis 1) matches that returned byget_vids(). To change the default order of individuals or genotypes useupdate_individuals()orupdate_variants(), respectively.- Parameters:
coordinates (dict) – a dictionary of genomic coordinates {“chr”: str, “start”: int, “end”: int}
- Returns:
tuple of genotypes for region of interest (numpy.ndarray (\(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs)); and pandas.DataFrame with names/vids of respective genotypes as index
- Return type:
(numpy.ndarray, dict, pandas.Index)
- Raises:
NotImplementedError: Interface.
- get_iids()[source]
Returns all individual ids.
- Returns:
Returns all individual ids.
- Return type:
np.ndarray
- Raises:
NotImplementedError – Interface.
- get_vids()[source]
Returns all variant ids.
- Returns:
Returns all variant ids.
- Return type:
np.ndarray
- Raises:
NotImplementedError – Interface.
- static mean_imputation(X)[source]
Replaces missing genotype values (np.nan) by the respective mean value for that variant.
Excludes missing values from computation.
- Parameters:
X (numpy.ndarray) – 2D array with dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.
- Returns:
array with missing values mean imputed along specified axis
- Return type:
numpy.ndarray
- static preprocess_genotypes(genotypes, vids, impute_mean=True, center=False, scale=False, normalize=False, invert_encoding=True, recode_maf=False, remove_singletons=False, remove_doubletons=False, min_maf=0, max_maf=1)[source]
Filters input variants and respective
vidsaccording to selected parameters.MAF: minor allele frequency.
- Parameters:
genotypes (numpy.ndarray) – \(n*m\) genotype matrix (with \(n:=\) number of individuals and \(m:=\) number of SNVs)
vids (pandas.Series or pandas.DataFrame) – information on the variants to be passed along (typically the variant ids)
impute_mean (bool) – whether missing values should be replaced by the mean
center (bool) – whether to mean center the data
scale (bool) – whether to scale the data by the standard deviation
normalize (bool) – whether to mean center and standardize the data by standard deviation
invert_encoding (bool) – whether to invert the encoding of the genotypes (pandas_plink which is used for data loading counts major alleles by default, thus by default we invert the encodings, such that the minor allele is counted)
recode_maf (bool) – encode genotypes according to sample, minor allele is counted (additive 0, 1, 2)
remove_singletons (bool) – whether to remove singleton genotypes (minor allele count 1); by default also removes singletons in case of reverted genotypes for the sample
remove_doubletons (bool) – whether to remove doubleton genotypes (minor allele count 2); by default also removes doubletons in case of reverted genotypes for the sample
min_maf (float) – variants with MAF smaller than min_maf in the sample are excluded
max_maf (float) – variants with MAF larger than max_maf in the sample are excluded
- Returns:
filtered genotypes, index of variants that remain after filtering
- Return type:
numpy.ndarray, pandas.Index
- static scale(X, inplace=False)[source]
Scales the genotype data by standard deviation, excluding missing values from computation.
- Parameters:
X (numpy.ndarray) – 2D array wit dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.
- Returns:
normalized array
- Return type:
numpy.ndarray
- static standardize(X, inplace=False)[source]
- Z-score normalizes the genotype data, excluding missing values from computation. If inplace == True, performs
operations inplace and returns a reference to X itself.
- src/seak/data_loaders.p :param numpy.ndarray X: 2D array with dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.
:: :return: z-score normalized array :rtype: numpy.ndarray
- update_individuals(iids, exclude=False)[source]
Set individuals to include/exclude.
When
exclude== False (default), this also changes the order to match the order iniids.- Parameters:
iids (pandas.Index) – Individual ids to include/exclude
- Raises:
NotImplementedError – Interface.
- update_variants(vids=None, coordinates=None, exclude=False)[source]
Set variants to include/exclude based on variant ids (
vids) or genetic coordinates (coordinates).When
exclude== False (default), this also changes the order to match the order invids.- Parameters:
vids (pandas.Index) – Variant ids to include/exclude
coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int} to include/exclude
- Raises:
NotImplementedError – Interface.
- class seak.data_loaders.VariantLoaderSnpReader(path_or_bed)[source]
Bases:
VariantLoaderLoads variants (SNVs) by wrapping SnpReader objects.
Can be initialized from either an SnpReader, or a path to a Plink bed-file.
Has methods to subset the dataset based on variant ids and individual ids. Can retrieve genotypes of a specific region or by variant id.
- Parameters:
path_or_bed (str) – SnpReader or path to binary PLINK 1 bed-file
- bed
- genotypes_by_id(vids, return_pos=False, missing_ok=False)[source]
Retrieves genotypes for a set of variant ids (
vids).The ordering of the individuals (rows, axis 0) matches that of
vids. The ordering of the variants (columns, axis 1) matches that returned byget_vids(). To change the default ordering of variants useupdate_variants().- Parameters:
vids (pandas.Index) – one or more variant ids to retrieve genotypes for
- Returns:
tuple of genotypes for region of interest (numpy.ndarray (number of individuals x number of SNVs)); and pandas.DataFrame with names/vids of respective genotypes as index
- Return type:
(numpy.ndarray, pandas.Index)
- genotypes_by_region(coordinates, return_pos=False)[source]
Given a region of interest returns the variants that lie in the respective region.
The order of the individuals (rows, axis 0) matches that returned by
get_iids(). The order of the variants (columns, axis 1) matches that returned byget_vids(). To change the default order of individuals or genotypes useupdate_individuals()orupdate_variants(), respectively If no variants lie in the requested region the method returns (None, None).- Parameters:
coordinates (dict) – a dictionary of genomic coordinates {“chr”: str, “start”: int, “end”: int}
return_pos (bool) – a boolean indicating whether variant positions should be returned
- Returns:
tuple of genotypes for region of interest (numpy.ndarray (number of individuals x number of SNVs)); and pandas.Index with names/vids of respective variants; If return_pos is True, a pandas.Series with variant positions.
- Return type:
(numpy.ndarray, pandas.Index, pandas.Series)
- iid_fid
- update_individuals(iids, exclude=False)[source]
Set individuals to include/exclude.
When
exclude== False (default), this also changes the order to match the order iniids:.- Parameters:
iids – Individual ids to include/exclude
- update_variants(vids=None, coordinates=None, exclude=False)[source]
Set variants to include/exclude based on variant ids (
vids) or genetic coordinates (coordinates).When
exclude== False (default), this also changes the order to match the order invids. Make sure that the vids are in the order of the veps in the respective index file if working with HDF5 files. –> Reason: HDF5 files can only be accessed indexing elements in increasing order!repeated IDs are ignored.
- Parameters:
vids – Variant ids to include/exclude
coordinates (dict) – genomic coordsrc/seak/data_loaders.pinates {“chr”: str, “start”: int, “end”: int} to include/exclude
- seak.data_loaders.intersect_and_update_datasets(test, variantloader, covariateloader, annotationloader=None, grmloader=None)[source]
Creates smallest common subset of class instances provided with respect to variants and individuals.
Modifies instances inplace.
For 2K case also computed background kernel \(K_0\).
- Parameters:
test (string) – [‘2K’, ‘noK’, ‘logit’]
variantloader (VariantLoader) – instance of respective class to be modified
covariateloader (CovariatesLoader) – instance of respective class to be modified
annotationloader (AnnotationLoader) – instance of respective class to be modified
grmloader (GRMLoader) – instance of respective class to be modified
The kernels module
Functions that can incorporate prior knowledge into set-association tests.
Any function that takes a genotype matrix G, variant effects V, and optional additional arguments and returns a set of transformed genotypes/variables to be tested.
- class seak.kernels.LocalCollapsing(distance_threshold=100.0)[source]
Bases:
objectClass for local collapsing of GV, based on a set of positions
- Parameters:
distance_threshold (float) – maximum distance allowed between SNPs within the same cluster
- collapse(G, pos, weights=1.0)[source]
Collapses G locally, based on pos
- Parameters:
G (np.ndarray) – Matrix containing the values to be collapsed, typically shape = (n_individuals, n_SNP)
pos (np.ndarray) – array-like, passed to sklearn.cluster.AgglomerativeClustering typically shape (n_samples, 1) for base pair positional clustering, but supports (n_samples, n_features) or (n_samples, n_samples).
weights (np.array) – single variant weights, shape = (n_SNP, ), default: equal weights for all variants
- Returns:
Collapsed matrix with shape (n_individuals, n_clusters), and the cluster assignments (n_individuals, )
- Return type:
tuple (np.ndarray, np.ndarray)
- seak.kernels.diffscore_max(G, V, scale_sqrt=True)[source]
Uses the largest absolute variant effect predictions (veps) per SNV.
Linear weighted kernel.
- Parameters:
G (numpy.ndarray) – SNVs to be tested (genotypes), \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of SNVs)
V (numpy.ndarray) – veps \(mxk\) (\(m:=\) number of SNVs, \(k:=\) number of veps); dimension \(m\) needs to be equal in G and V
- Returns:
linear weighted genotype matrix (\(GxV\))
- Return type:
numpy.ndarray
- seak.kernels.linear_weighted(G, V)[source]
Linear weighted kernel.
Scales the SNVs by the variant effect predictions (veps) corresponding of a 1-dimensional array which gets transformed into a diagonal weight matrix as linear weights.
- Parameters:
G (numpy.ndarray) – SNVs to be tested (genotypes), \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of SNVs)
V (numpy.ndarray) – veps \(mx1\) (\(m:=\) number of SNVs) with dimension 1 and shape: \(m\)
- Returns:
linear weighted genotype matrix (\(GxV\))
- Return type:
numpy.ndarray
- seak.kernels.single_column_kernel(i, scale_sqrt=True)[source]
Returns a kernel function with weights that correspond to a single column of the veps numpy.ndarray.
- Parameters:
i (int) – column of which weights should be selected
- Returns:
kernel function that linearly weights the genotypes with a single column of veps (e.g. corresponding to a specific transcription factor), needs to be called with genotypes and veps array
- Return type:
function
The lrt module
Contains classes and functions for Likelihood ratio tests.
The class LRTnoK implements the set-based likelihood ratio test for continuous phenotypes. A background kernel is not supported.
Null model:
Alternative model:
With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates), \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for).
- class seak.lrt.LRTnoK(X, Y, REML=True)[source]
Bases:
objectSingle kernel Likelihood Ratio Test (LRT) for continuous phenotypes.
Sets up null model for given phenotypes and covariates. Use
altmodel()to fit an alternative model, and calculate the LRT test statistic. Usepv_sim()to simulate test statistics and generate empirical p-values.- Parameters:
X (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)
Y (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)
REML (bool) – use restricted maximum likelihood? (only “True” has been tested)
- Variables:
X – X
Y – Y
Xdagger – Moore-Penrose pseudo-inverse of X
model0 – dictionary with nLL of the null model
model1 – dictionary with nLL and other parameters of the alternative model
- altmodel(G1)[source]
Fit alternative model. Returns a dictionary with model parameters and test statistic.
- Parameters:
G1 (numpy.ndarray) – \(G_1\): set of variants (i.e. variables) to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for)
- Returns:
Dictionary of fitted model parameters
- Return type:
dict
Output dictionary: ‘nLL’ : negative log-likelihood ‘sigma2’ : the model variance sigma^2 ‘stat’ : LRT test statistic ‘alteqnull’ : h2 is 0 or negative (i.e. p-value is 1.) ‘beta’ : [D*1] array of fixed effects weights beta ‘h2’ : mixture weight between Covariance and noise ‘REML’ : True: REML was computed, False: ML was computed ‘a2’ : mixture weight between K0 and K1 ‘scale’ : Scale parameter that multiplies the Covariance matrix (default 1.0) ————————————————————————–
- pv_5050(lik1=None)[source]
Return p-value(s) calculated assuming a 50/50 mixture of chi2 df0 and chi2 df1. Considered a conservative estimate. If lik1 is None, returns the p-value for the current alternative model.
- Parameters:
lik1 (Union[float, numpy.ndarray]) – likelihood ratio test statistic
- Return type:
Union[float, numpy.ndarray]
- pv_sim(nsim=100000, seed=420)[source]
Runs “nsim” simulations of the test statistic IF the likelihood of the alternative model is greater than that of the null model, i.e. self.model_lik[‘alteqnull’] is False. returns a dictionary with the simulations and empirical p-value. The empirical p-value is 1. if self.model_lik[‘alteqnull’] is True, and the array of simulated test statistics will be empty.
- Parameters:
nsim (int) – Number of simulated test statistics to return
seed (int) – numpy random seed to use for simulations
- Returns:
Dictionary with containing the simulated test statistics, and empirical p-value
- Return type:
dict
- Output dictionary:
“res”: array of simulated test statistics “pv”: empirical p-value of current alternative model
- pv_sim_chi2(nsim=300, nsim0=100000, simzero=True, method='QR', seed=420, qmax=0.1)[source]
Calculate a gene-specific p-value by simulating nsim test statistics and estimating the mixture distribution as in Zhou 2016.
- Parameters:
nsim (int) – Number of simulations for the (R)LRT test statistic
nsim0 (int) – Number of simulations to estimate point-mass at 0
method (str) – Method used to fit continuous part of the the chi2 mixure distribution. ‘MM’ -> moment matching, ‘ML’ -> maximum likelihood or ‘QR’ -> Quantile regression
qmax (float) – Use only the values in the top qmax quantile to fit the chi2 distribution. This only affects the QR-method
simzero (bool) – Whether to separately run simulations for the estimation of the mixture parameter (default: True). This makes sense if nsim0 >> nsim
- Returns:
dictionary similar to that returned by pv_sim(), but with additional slots for estimated dof, scale and mixture parameters for the chi2 distribution
- seak.lrt.RLRTSim(X, Z, Xdagger, sqrtSigma=None, lambda0=nan, seed=2020, nsim=10000, use_approx=0, log_grid_hi=8, log_grid_lo=-10, gridlength=200)[source]
Python port of the LRTSim function using scipy & numpy see https://cran.r-project.org/web/packages/RLRsim/RLRsim.pdf for details, also Crainiceanu, C. and Ruppert, D. (2004) Likelihood ratio tests in linear mixed models with one variance component
- Parameters:
X (numpy.ndarray) – The fixed effects design matrix of the model under the alternative
Z (numpy.ndarray) – The random effects design matrix of the model under the alternative (i.e. G_1)
Xdagger (numpy.ndarray) – Moore-Penrose pseudo-inverse of X
sqrtSigma (numpy.ndarray) – The upper triangular Cholesky factor of the correlation matrix of the random effect
lambda0 (numpy.ndarray) –
seed – numpy random seed to use
nsim (int) – Number of values to simulate
use_approx (int) – Not implemented (yet)
log_grid_hi – Higher value of the grid on the log scale.
log_grid_lo – Lower value of the grid on the log scale.
gridlength (int) – Length of the grid for the grid search over lambda
- Returns:
A dictionary containing the simulations and other parameters
- Return type:
dict
- Output dictionary:
- “res”: np.array of simulated test statistics
…: parameters and intermediate variables used to generate simulated test statistics
- class seak.lrt.chi2mixture(lrt, mixture=None, tol=0.0, scalemin=0.1, scalemax=5.0, dofmin=0.1, dofmax=5.0, qmax=None, alteqnull=None, abserr=None, fitdof=None, dof=None)[source]
Bases:
objectmixture here denotes the weight on the non-zero dof compnent
- abserr
- alteqnull
- dof
- dofmax
- dofmin
- fit_params_Qreg()[source]
Fit the scale and dof parameters of the model by minimizing the squared error between the model log quantiles and the log P-values obtained on the lrt values. Only the top qmax quantile is being used for the fit (self.qmax is used in fit_scale_logP).
- fitdof
- imax
- isortlrt
- lrt
- lrtsort
- mixture
- qmax
- qnulllrtsort
- scale
- scalemax
- scalemin
- tol
- seak.lrt.estimate_pointmass0_null(mu, n, p, nsim=100000, xi=None, seed=None, REML=True)[source]
The point mass of the null distribution of the (R)LRT at 0 can be approximated by the probability of the (R)LRT test statistic having a local maximum at 0. See Crainiceanu, Ruppert: Likelihood ratio tests in linear mixed models
with one variance component (2003), Equations (8) & (9)
- and Zhou: Boosting Gene Mapping Power and Efficiency with Efficient Exact Variance
Component Tests of Single Nucleotide Polymorphism Sets (2016)
- Parameters:
mu (numpy.ndarray) – Eigenvalues of the Genotype matrix with covariates projected out
n (float) – Number of individuals
p (float) – Number of fixed effect parameters (covariates)
nsim (float) – Number of simulations to perform (default: 100,000)
xi (numpy.ndarray) – Eigenvalues of the Genotype matrix, if REML == False
int – numpy seed
REML – use REML? (default: True)
- Returns:
The probability of having point-mass at null
- Return type:
float
- seak.lrt.fit_chi2_maximum_likelihood(sims)[source]
Fits a chi2 distribution using Maximum Likelihood. (wraps the sci)
This has not been tested and is not recommended.
- Parameters:
sims – array of LRT test statistics (continuous part of the distribution)
- Returns:
Dictionary of distribution parameters estimated with Maximum Likelihood (ML)
- seak.lrt.fit_chi2_moment_matching(sims)[source]
Fits a chi2 distribution using moment matching (as in Zhou 2016)
This has not been tested and is not recommended.
- Parameters:
sims – array of LRT test statistics (continuous part of the distribution)
- Returns:
Dictionary of distribution parameters estimated with Moment Matching (MM)
- seak.lrt.fit_chi2mixture(sims, mixture=None, qmax=0.1)[source]
Takes simulated test statistics as input. Fits a chi2 mixture to them using “quantile regression” (Listgarten 2013). returns a dictionary.When mixture parameter is given it only fits scale and dof.
This is the preferred method.
- Parameters:
sims (numpy.ndarray) – array of tests statistics
mixture (float) – mixture weight
qmax (float) – fraction of largest test statistics to fit the distribution with
- Returns:
Dictionary with chi2 mixture parameters
- Return type:
dict
- Output dictionary:
“mse” : mean squared error “dof” : degrees of freedom “scale” : scale “imax” : number of values used to fit the ditribution
- seak.lrt.make_lambdagrid(lambda0, gridlength, log_grid_lo, log_grid_hi)[source]
Port of the R-function to create the lambda grid (used inside RLRTsim).
Creates a grid of lambda values to optimize over.
- seak.lrt.pv_chi2mixture(stat, scale, dof, mixture, alteqnull=None)[source]
Returns p-values(s) using chi2 with custom parameters. see
fit_chi2mixture()wraps chi2.sf(stat/scale, dof) * mixture :param stat: The LRT test statistic :type stat: Union[float, np.array] :param float scale: scaling factor applied to the test statistic :param float dof: degrees of freedom :param float mixture: mixture weight :return: p-value :rtype: float
- seak.lrt.rlrsim(p, k, n, nsim, g, q, mu, lambd, lambda0, xi=None, REML=True)[source]
Python port of the RLRsim.cpp function using scipy & numpy
This version uses broadcasting + cython to speed up operations. It is about 3-10 times faster than the numpy-version (rlrsim_np).
- Parameters:
p (int) – number of covariates in X
k (int) – number of variables to test
n (int) – number of individuals
nsim (int) – number of sumulations to perform
g (int) – lambda grid length
q (int) – number of free parameters in the null-model (?)
mu (numpy.ndarray) – array of eigenvalues
lambd (numpy.ndarray) – array of lambda values
lambda0 (float) – lambda0 value
REML (bool) – perform REML (default=True)
- Xi:
?
- seak.lrt.rlrsim_loop(p, k, n, nsim, g, q, mu, lambd, lambda0, xi=None, REML=True)[source]
Python port of the RLRsim.cpp function using scipy & numpy
This is a “naive” port with the same loops used in the C++ code It’s recommended to use rlrsim() instead, as it’s much faster. This function is mainly here for documentation purposes.
See
rlrsim()
- seak.lrt.rlrsim_np(p, k, n, nsim, g, q, mu, lambd, lambda0, xi=None, REML=True)[source]
Python port of the RLRsim.cpp function using scipy & numpy
This version uses broadcasting to speed up operations. It uses a lot of memory for large nsim, but is way faster than rlrsim_loop()
- Parameters:
p (int) – number of covariates in X
k (int) – number of variables to test
n (int) – number of individuals
nsim (int) – number of sumulations to perform
g (int) – lambda grid length (not used anymore…)
q (int) – number of free parameters in the null-model (?)
mu (numpy.ndarray) – array of eigenvalues
lambd (numpy.ndarray) – array of lambda values
lambda0 (float) – lambda0 value
REML (bool) – perform REML (default=True)
- Xi:
?
The scoretest module
Contains classes for variance component score tests.
A single kernel and two-kernel score-based test is available for a linear link function (continuous phenotypes) (ScoretestNoK and Scoretest2K).
With the second kernel \(K_0\) correcting for population structure.
A single kernel score-based test is available for a logistic link function (binary phenotypes) (ScoretestLogit)
Null model single kernel:
Alternative model single kernel:
Null model two-kernel:
Alternative model two-kernel:
With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates), \(G_0\): variants to construct the background kernel \(K_0\) from, correcting for population structure, dimensions \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of variants/variants), \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for).
Note
For all classes of the module scoretest the class attributes are instance attributes! This is a bug in the automatic documentation.
Note
The source code of the mathematical implementation is adopted from the FaST-LMM Python package from Microsoft Corporation.
Source code can be found in fastlmm.association.score.
- exception seak.scoretest.DaviesError[source]
Bases:
Exceptionraise this error if Davie’s method does not converge
- class seak.scoretest.Scoretest(phenotypes, covariates)[source]
Bases:
objectSuperclass for all score-based set-association tests.
Defines all common attributes and implements methods that all subclasses share, such as methods for p-value computation for alternative
pv_alt_model()model.Interface for Scoretest subclasses.
- Variables:
Y (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)
X (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)
N (int) – number of individuals
P (int) – number of phenotypes
D (int) – number of covariates
Neff (int) – effective sample size (number of individuals) as used in calculations of unbiased estimators
- D
- N
- Neff
- P
- X
- Y
- static has_bias(covariates)[source]
Checks whether at least one all constant column (bias) is present in the covariates.
- pv_alt_model(G1, G2=None, method='davies')[source]
Computes p-value of the alternative model.
- Parameters:
G1 (numpy.ndarray) – set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants to test association for)
G2 (numpy.ndarray) – set of variants to condition on (\(n:=\) number of individuals, \(k:=\) number of variants to condition on)
- Returns:
p-value of the alternative model
- Return type:
float
- class seak.scoretest.Scoretest2K(phenotypes, covariates=None, K0=None, G0=None, forcefullrank=False)[source]
Bases:
ScoretestTwo-kernel score-based set-association test for continuous phenotypes.
Sets up null model for given phenotypes, covariates and background kernel \(K_0\) or background genotypes \(G_0\). If no covariates are given or if the covariates do not contain a bias column, a bias column is added. Compute p-value for alternative model with
pv_alt_model().Null model two-kernel:
\[Y = {\alpha} X + {\beta} G_0 + {\epsilon}\]Alternative model two-kernel:
\[Y = {\alpha} X + {\beta} G_0 + {\gamma} G_1 + {\epsilon}\]With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates) \(G_0\): variants to construct the background kernel \(K_0\) from, correcting for population structure, dimensions \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of variants/variants) \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for)
- Parameters:
phenotypes (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)
covariates (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)
K0 (numpy.ndarray) – genetic similarity matrix/GRM \(K_0\) accounting for confounding
G0 (numpy.ndarray) – genotype matrix \(G_0\) used to contruct math:K_0 to account for confounding
forcefullrank (boolean) – for testing purposes only
- Variables:
S – eigenvalues of PKP
Xdagger – Moore-Penrose pseudo-inverse of the covariate matrix X
sigma2e – environmental variance
sigma2g – genetic variance
- G0
- K0
- S
- U
- UUY
- UY
- Xdagger
- YUUY
- lowrank
- optparams
- sigma2e
- sigma2g
- class seak.scoretest.ScoretestLogit(phenotypes, covariates)[source]
Bases:
ScoretestSingle kernel score-based set-association test for binary phenotypes.
Sets up null model for given phenotypes and covariates. If no covariates are given or if the covariates do not contain a bias column, a bias column is added. Compute p-value for alternative model with
pv_alt_model().- Parameters:
phenotypes (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)
covariates (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)
- VX
- pY
- pinvVX
- stdY
- class seak.scoretest.ScoretestNoK(phenotypes, covariates)[source]
Bases:
ScoretestSingle kernel score-based set-association test for continuous phenotypes.
Sets up null model for given phenotypes and covariates. If no covariates are given or if the covariates do not contain a bias column, a bias column is added. Compute p-value for alternative model with
pv_alt_model().Null model single kernel:
\[Y = {\alpha} X + {\epsilon}\]Alternative model single kernel:
\[Y = {\alpha} X + {\gamma} G_1 + {\epsilon}\]With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates) \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for)
- Parameters:
phenotypes (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)
covariates (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)
- Variables:
RxY – OLS residuals of the phenotype after regressing out fixed effects (covariates X)
Xdagger – Moore-Penrose pseudo-inverse of the covariate matrix X
sigma2 – environmental variance
- RxY
- Xdagger
- coef(G1)[source]
Returns single-variant regression coefficients and their estimated variances
- Parameters:
G1 (numpy.ndarray) – N x 1 vector
- Return dict:
dictionary with two slots: “beta” and “var_beta”
- sigma2