seak¶
The construct_background_kernel module¶
Computes a background kernel matrix \(K_0\) or respective genotype matrix \(G_0\) from
binary PLINK 1 files (GRMLoader) or
loads a precomputed matrix \(K_0\) from a NumPy file (GRM_from_file).
Needed, if a linear mixed model is used to correct for confounding. In low rank case (\(n>k\)): loads a genotype matrix \(G_0\). In full rank case (\(n<k\)): computes background kernel \(K_0\). With \(n:=\) number of individuals and \(k:=\) number of SNVs.
GRM: genetic relatedness matrix.
- class seak.construct_background_kernel.GRMLoaderSnpReader(path_or_bed, blocksize, LOCO_chrom_id=None, forcelowrank=False)[source]¶
Bases:
objectConstructs a background kernel \(K_0\) from given binary PLINK 1 genotype files using the leave-one-out-chromosome (LOCO) strategy.
Initially no background kernel is constructed, only the instance attributes are initialized. The kernel gets constructed when calling the method
compute_background_kernel()which should only be called after calling theupdate_ind()method manually or theseak.data_loaders.intersect_and_update_datasets(). This way, individuals which are neither contained in the test nor in the background kernel data set are excluded.In full rank case, loads the SNPs in blocks to construct the kernel. In low rank case, loads all SNPs into memory at once.
- Parameters
path_to_plink_files_with_prefix (str) – path prefix to genotype PLINK files for background kernel construction
blocksize (int) – how many genotypes to load at once; should be chosen dependent on RAM available
LOCO_chrom_id (str/int) – identifier of the chromosome/region that is used in the respective test set and should be excluded from the background kernel or None if all variants should be included
forcelowrank (bool) – enforce low rank data loading behavior for testing purposes
Note
The leave-one-chromosome-out (LOCO) strategy can be disabled with
LOCO_chrom_id.- compute_background_kernel()[source]¶
Computes background kernel \(K_0\) for given set of genotypes (binary PLINK 1 files).
Overlap with data of set to be tested should have been carried out before, such that individuals in both data sets match. Does not return anything but sets instance attributes for either the background kernel \(K_0\) or the background kernel genotype matrix \(G_0\).
- class seak.construct_background_kernel.GRMLoader_from_file(path_to_GRM_npy, forcelowrank=False)[source]¶
Bases:
objectLoad a precomputed GRM from a NumPy file (.npy or .npz).
Must either be of low rank \(n>k\) or a square \(nxn\) matrix with \(n:=\) number of individuals and \(k:=\) number of SNVs.
The data_loaders module¶
Contains all data IO and processing functionalities to read and manipulate genotype, covariate and variant effect prediction data.
Implements interfaces for all data types such that you can write your custom data loading classes (VariantLoader, AnnotationLoader, CovariatesLoader).
Accepted input file format for genotype data is the binary PLINK 1 format.
Accepted input file format for covariate and phenotype data is the CSV format.
Accepted input file format for variant effect predictions is the HDF5 format, following the syntax of the output of
Janggu.predict_variant_effect() (Janggu documentation) which needs to be accompanied by a index file in TSV format.
Note
The program assumes that genotype files are 1-based and fully closed, and variant effect prediction files 0-based and half-open.
- class seak.data_loaders.AnnotationLoader[source]¶
Bases:
objectInterface for loading variant annotations that implements data preprocessing methods.
- anno_by_id(vids)[source]¶
Retrieves annotations for a set of variant ids (
vids).The order will match that of
vids.- Parameters
vids (pandas.Index) – names of variants to retrieve annotations for
- Returns
variant effect predictions for given variants
- Return type
numpy.ndarray
- Raises
NotImplementedError – Interface.
- get_vids()[source]¶
Returns all variant ids.
- Returns
Returns all variant ids.
- Return type
pandas.Index
- Raises
NotImplementedError – Interface.
- update_variants(vids=None, coordinates=None, exclude=False)[source]¶
Set variants to include/exclude based on variant ids (
vids) or genetic coordinates (coordinates).When
exclude== False (default), this also changes the order to match the order ofvids.- Parameters
vids (pandas.Index) – Variant ids to include/exclude
coordinates (dict) – a dictionary of genomic coordinates {“chr”: str, “start”: int, “end”: int, “name”: str}
- Raises
NotImplementedError – Interface.
- class seak.data_loaders.BEDRegionLoader(path_to_regions_UCSC_BED, chrom_to_load=None, drop_non_numeric_chromosomes=False)[source]¶
Bases:
seak.data_loaders.RegionLoaderReads all regions in a UCSC BED-formatted file into memory.
- Variables
regions (pandas.DataFrame) – DataFrame storing regions in memory {“chr”: str, “start”: int, “end”: int, “name”: str}
- regions¶
- class seak.data_loaders.CovariatesLoader[source]¶
Bases:
objectStores covariates (fixed effects) and phenotypes.
- class seak.data_loaders.CovariatesLoaderCSV(phenotype_of_interest, path_to_covariates, covariate_column_names, sep=',', path_to_phenotypes=None)[source]¶
Bases:
seak.data_loaders.CovariatesLoaderStores covariates (fixed effects) and phenotypes.
Initializes instance of a CovariatesLoader class from a CSV file.
- Parameters
phenotype_of_interest (str) – column name of phenotype that should be loaded
path_to_covariates (str) – path to CSV file with phenotype (if path_to_phenotypes is None) and covariates
covariate_column_names (list(str)) – list of the covariate column names
path_to_phenotypes (str) – path to CSV file with phenotype (if stored separately from covariates)
- cov¶
- covariate_column_names¶
- get_one_hot_covariates_and_phenotype(test_type, pheno=None)[source]¶
Returns numpy.ndarray of the phenotype and one hot encoded covariates with invariant covariates removed and a bias column.
Make sure
update_cov()was called beforehand, such thatself.covhas the same order as the genotypes. Internally calls functionpandas.get_dummies()which converts categorical variable into indicator variables.- Parameters
test_type (string) – Either ‘logit’, ‘2K’ or ‘noK’
- Returns
data for phenotype and one hot encoded covariates with invariant covariates removed and bias column
- Return type
numpy.ndarray
- phenotype_of_interest¶
- class seak.data_loaders.EnsemblVEPLoader(uploaded_variation, location, gene, allele=None, data=None)[source]¶
Bases:
seak.data_loaders.AnnotationLoaderCreates an AnnotationLoader from outputs columns of the Ensembl Variant Effect Predictor (VEP).
Specifically the columns ‘Uploaded_variation’, ‘Location’, ‘Allele’ and ‘Gene’. Additionally will take a numpy array or pandas dataframe with effects for those variants.
The format of the columns are described on: https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#defaultout
Uploaded variation - as chromosome_start_alleles Location - in standard coordinate format (chr:start or chr:start-end) Allele - the variant allele used to calculate the consequence Gene - Ensembl stable ID of affected gene
Positions in “Location” are one-based. “Uploaded_variation” is used as the variant ID.
- Parameters
uploaded_variation – Series containing “Uploaded variation”
location – Series containing “Location”
allele – Series containing “Allele”
gene – Series containing “Gene”
data – DataFrame or ndarray containing the variant effect predictions
- anno_by_id(vids=None, gene=None)[source]¶
Retrieves gene-specific variant effect predictions based on variant IDs (vids) and gene
- Parameters
vids – variant ids
gene – gene ids
default – if not None, default value to return when no variant effect predictions are found for a requested variant.
- anno_by_interval(coordinates, gene=None)[source]¶
Retrieves gene-specific variant effect predictions based on position and gene
- Parameters
coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int}
gene – gene id
- get_vids()[source]¶
Returns all variant ids.
- Returns
Returns all variant ids.
- Return type
pandas.Index
- Raises
NotImplementedError – Interface.
- pos_df¶
- update_variants(vids=None, coordinates=None, exclude=False)[source]¶
Set variants to include/exclude based on variant ids (
vids) or genetic coordinates (coordinates).- Parameters
vids (pandas.Index) – Variant ids to include/exclude
coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int} to include/exclude
- vep_df¶
- class seak.data_loaders.Hdf5Loader(path_to_vep_bed, path_to_vep_hdf5, hdf5_key, from_janggu=False)[source]¶
Bases:
seak.data_loaders.AnnotationLoaderLoads variant annotations from HDF5 files accompanied by a UCSC BED file or TSV with variant information.
The HDF5 files need to follow the syntax of the output of
Janggu.predict_variant_effect()(Janggu documentation). You need to specify a key to the data. Instances of the class still needs to be overlapped with respective genotypes and vice versa. UCSC BED files/or TSV require first three fields: chrom, chromStart, chromEnd; we also require field name, that needs to match the names of the genotype data. Above that: if TSV and not USCS BED file is given make sure the genomic coordinates are 0-based an half-open.- Parameters
path_to_vep_bed (str) – path to UCSC BED file containing meta info of variant effect predictions
path_to_vep_hdf5 (str) – path to HDF5 file containing the variant effect predictions
hdf5_key (str) – name of variant effect predict to load
from_janggu (bool) – whether janggu was used to generate the veps, if yes, naming convention is presupposed
- anno_by_id(vids, shuffle=None)[source]¶
Retrieves annotations for a set of variant ids (
vids).The order will match that of vids.
- Parameters
vids (pandas.Index) – names of variants to retrieve annotations for
shuffle (int) – For testing purpose only; default None; shuffles loaded veps along axis 0 or 1
- Returns
variant effect predictions for given variants
- Return type
numpy.ndarray
- from_janggu¶
- set_mask(mask)[source]¶
Sets a boolean mask along the annotation-axis to be applied when loading data. Can be used to sub-set annotations.
- Parameters
mask (np.ndarray) – boolean mask to be applied to the columns (second dimension) of the annotations loaded.
- update_variants(vids=None, coordinates=None, exclude=False)[source]¶
Set variants to include/exclude based on variant ids (
vids) or genetic coordinates (coordinates).When
exclude== False (default), this also changes the order to match the order invids. Make sure that the vids are in the order of the veps in the respective index file if working with HDF5 files. –> Reason: HDF5 files can only be accessed indexing elements in increasing order!- Parameters
vids (pandas.Index) – Variant ids to include/exclude
coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int} to include/exclude
- veps¶
- veps_index_df¶
- class seak.data_loaders.RegionLoader[source]¶
Bases:
objectInterface for loading genomic regions whose subclasses can be used to group variants into sets to test jointly.
Subclasses need only implement __iter__, which yields 0-based, right-open (->left-closed) dictionaries of genomic intervals {“chr”: str, “start”: int, “end”: int, “name”: str}
- class seak.data_loaders.VariantLoader[source]¶
Bases:
objectInterface for loading genotype data for variants that implements data preprocessing methods.
- static center(X, inplace=False)[source]¶
Mean centers genotype values, excluding missing values from computation.
- Parameters
X (numpy.ndarray) – 2D array with dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.
- Returns
mean-centered array
- Return type
numpy.ndarray
- genotypes_by_id(vids)[source]¶
Retrieves genotypes for a set of variant ids (
vids).The ordering of the individuals (rows, axis 0) matches that returned by
get_iids(). The ordering of the variants (columns, axis 1) matches that returned byget_vids(). To change the default ordering of variants useupdate_individuals().- Parameters
vids (pandas.Index) – one or more variant ids to retrieve genotypes for
- Returns
tuple of genotypes for region of interest (numpy.ndarray (\(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs)); and pandas.DataFrame with names/vids of respective genotypes as index
- Return type
(numpy.ndarray, pandas.Index)
- Raises
NotImplementedError: Interface.
- genotypes_by_region(coordinates)[source]¶
Given a region of interest returns the variants that lie in the respective region.
The order of the individuals (rows, axis 0) matches that returned by
get_iids(). The order of the variants (columns, axis 1) matches that returned byget_vids(). To change the default order of individuals or genotypes useupdate_individuals()orupdate_variants(), respectively.- Parameters
coordinates (dict) – a dictionary of genomic coordinates {“chr”: str, “start”: int, “end”: int}
- Returns
tuple of genotypes for region of interest (numpy.ndarray (\(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs)); and pandas.DataFrame with names/vids of respective genotypes as index
- Return type
(numpy.ndarray, dict, pandas.Index)
- Raises
NotImplementedError: Interface.
- get_iids()[source]¶
Returns all individual ids.
- Returns
Returns all individual ids.
- Return type
np.ndarray
- Raises
NotImplementedError – Interface.
- get_vids()[source]¶
Returns all variant ids.
- Returns
Returns all variant ids.
- Return type
np.ndarray
- Raises
NotImplementedError – Interface.
- static mean_imputation(X)[source]¶
Replaces missing genotype values (np.nan) by the respective mean value for that variant.
Excludes missing values from computation.
- Parameters
X (numpy.ndarray) – 2D array with dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.
- Returns
array with missing values mean imputed along specified axis
- Return type
numpy.ndarray
- static preprocess_genotypes(genotypes, vids, impute_mean=True, center=False, scale=False, normalize=False, invert_encoding=True, recode_maf=False, remove_singletons=False, remove_doubletons=False, min_maf=0, max_maf=1)[source]¶
Filters input variants and respective
vidsaccording to selected parameters.MAF: minor allele frequency.
- Parameters
genotypes (numpy.ndarray) – \(n*m\) genotype matrix (with \(n:=\) number of individuals and \(m:=\) number of SNVs)
vids (pandas.Series or pandas.DataFrame) – information on the variants to be passed along (typically the variant ids)
impute_mean (bool) – whether missing values should be replaced by the mean
center (bool) – whether to mean center the data
scale (bool) – whether to scale the data by the standard deviation
normalize (bool) – whether to mean center and standardize the data by standard deviation
invert_encoding (bool) – whether to invert the encoding of the genotypes (pandas_plink which is used for data loading counts major alleles by default, thus by default we invert the encodings, such that the minor allele is counted)
recode_maf (bool) – encode genotypes according to sample, minor allele is counted (additive 0, 1, 2)
remove_singletons (bool) – whether to remove singleton genotypes (minor allele count 1); by default also removes singletons in case of reverted genotypes for the sample
remove_doubletons (bool) – whether to remove doubleton genotypes (minor allele count 2); by default also removes doubletons in case of reverted genotypes for the sample
min_maf (float) – variants with MAF smaller than min_maf in the sample are excluded
max_maf (float) – variants with MAF larger than max_maf in the sample are excluded
- Returns
filtered genotypes, index of variants that remain after filtering
- Return type
numpy.ndarray, pandas.Index
- static scale(X, inplace=False)[source]¶
Scales the genotype data by standard deviation, excluding missing values from computation.
- Parameters
X (numpy.ndarray) – 2D array wit dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.
- Returns
normalized array
- Return type
numpy.ndarray
- static standardize(X, inplace=False)[source]¶
- Z-score normalizes the genotype data, excluding missing values from computation. If inplace == True, performs
operations inplace and returns a reference to X itself.
- src/seak/data_loaders.p :param numpy.ndarray X: 2D array with dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.
:: :return: z-score normalized array :rtype: numpy.ndarray
- update_individuals(iids, exclude=False)[source]¶
Set individuals to include/exclude.
When
exclude== False (default), this also changes the order to match the order iniids.- Parameters
iids (pandas.Index) – Individual ids to include/exclude
- Raises
NotImplementedError – Interface.
- update_variants(vids=None, coordinates=None, exclude=False)[source]¶
Set variants to include/exclude based on variant ids (
vids) or genetic coordinates (coordinates).When
exclude== False (default), this also changes the order to match the order invids.- Parameters
vids (pandas.Index) – Variant ids to include/exclude
coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int} to include/exclude
- Raises
NotImplementedError – Interface.
- class seak.data_loaders.VariantLoaderSnpReader(path_or_bed)[source]¶
Bases:
seak.data_loaders.VariantLoaderLoads variants (SNVs) by wrapping SnpReader objects.
Can be initialized from either an SnpReader, or a path to a Plink bed-file.
Has methods to subset the dataset based on variant ids and individual ids. Can retrieve genotypes of a specific region or by variant id.
- Parameters
path_or_bed (str) – SnpReader or path to binary PLINK 1 bed-file
- bed¶
- genotypes_by_id(vids, return_pos=False, missing_ok=False)[source]¶
Retrieves genotypes for a set of variant ids (
vids).The ordering of the individuals (rows, axis 0) matches that of
vids. The ordering of the variants (columns, axis 1) matches that returned byget_vids(). To change the default ordering of variants useupdate_variants().- Parameters
vids (pandas.Index) – one or more variant ids to retrieve genotypes for
- Returns
tuple of genotypes for region of interest (numpy.ndarray (number of individuals x number of SNVs)); and pandas.DataFrame with names/vids of respective genotypes as index
- Return type
(numpy.ndarray, pandas.Index)
- genotypes_by_region(coordinates, return_pos=False)[source]¶
Given a region of interest returns the variants that lie in the respective region.
The order of the individuals (rows, axis 0) matches that returned by
get_iids(). The order of the variants (columns, axis 1) matches that returned byget_vids(). To change the default order of individuals or genotypes useupdate_individuals()orupdate_variants(), respectively If no variants lie in the requested region the method returns (None, None).- Parameters
coordinates (dict) – a dictionary of genomic coordinates {“chr”: str, “start”: int, “end”: int}
return_pos (bool) – a boolean indicating whether variant positions should be returned
- Returns
tuple of genotypes for region of interest (numpy.ndarray (number of individuals x number of SNVs)); and pandas.Index with names/vids of respective variants; If return_pos is True, a pandas.Series with variant positions.
- Return type
(numpy.ndarray, pandas.Index, pandas.Series)
- iid_fid¶
- update_individuals(iids, exclude=False)[source]¶
Set individuals to include/exclude.
When
exclude== False (default), this also changes the order to match the order iniids:.- Parameters
iids – Individual ids to include/exclude
- update_variants(vids=None, coordinates=None, exclude=False)[source]¶
Set variants to include/exclude based on variant ids (
vids) or genetic coordinates (coordinates).When
exclude== False (default), this also changes the order to match the order invids. Make sure that the vids are in the order of the veps in the respective index file if working with HDF5 files. –> Reason: HDF5 files can only be accessed indexing elements in increasing order!repeated IDs are ignored.
- Parameters
vids – Variant ids to include/exclude
coordinates (dict) – genomic coordsrc/seak/data_loaders.pinates {“chr”: str, “start”: int, “end”: int} to include/exclude
- seak.data_loaders.intersect_and_update_datasets(test, variantloader, covariateloader, annotationloader=None, grmloader=None)[source]¶
Creates smallest common subset of class instances provided with respect to variants and individuals.
Modifies instances inplace.
For 2K case also computed background kernel \(K_0\).
- Parameters
test (string) – [‘2K’, ‘noK’, ‘logit’]
variantloader (VariantLoader) – instance of respective class to be modified
covariateloader (CovariatesLoader) – instance of respective class to be modified
annotationloader (AnnotationLoader) – instance of respective class to be modified
grmloader (GRMLoader) – instance of respective class to be modified
The kernels module¶
Functions that can incorporate prior knowledge into set-association tests.
Any function that takes a genotype matrix G, variant effects V, and optional additional arguments and returns a set of transformed genotypes/variables to be tested.
- class seak.kernels.LocalCollapsing(distance_threshold=100.0)[source]¶
Bases:
objectClass for local collapsing of GV, based on a set of positions
- Parameters
distance_threshold (float) – maximum distance allowed between SNPs within the same cluster
- collapse(G, pos, weights=1.0)[source]¶
Collapses G locally, based on pos
- Parameters
G (np.ndarray) – Matrix containing the values to be collapsed, typically shape = (n_individuals, n_SNP)
pos (np.ndarray) – array-like, passed to sklearn.cluster.AgglomerativeClustering typically shape (n_samples, 1) for base pair positional clustering, but supports (n_samples, n_features) or (n_samples, n_samples).
weights (np.array) – single variant weights, shape = (n_SNP, ), default: equal weights for all variants
- Returns
Collapsed matrix with shape (n_individuals, n_clusters), and the cluster assignments (n_individuals, )
- Return type
tuple (np.ndarray, np.ndarray)
- seak.kernels.diffscore_max(G, V, scale_sqrt=True)[source]¶
Uses the largest absolute variant effect predictions (veps) per SNV.
Linear weighted kernel.
- Parameters
G (numpy.ndarray) – SNVs to be tested (genotypes), \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of SNVs)
V (numpy.ndarray) – veps \(mxk\) (\(m:=\) number of SNVs, \(k:=\) number of veps); dimension \(m\) needs to be equal in G and V
- Returns
linear weighted genotype matrix (\(GxV\))
- Return type
numpy.ndarray
- seak.kernels.linear_weighted(G, V)[source]¶
Linear weighted kernel.
Scales the SNVs by the variant effect predictions (veps) corresponding of a 1-dimensional array which gets transformed into a diagonal weight matrix as linear weights.
- Parameters
G (numpy.ndarray) – SNVs to be tested (genotypes), \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of SNVs)
V (numpy.ndarray) – veps \(mx1\) (\(m:=\) number of SNVs) with dimension 1 and shape: \(m\)
- Returns
linear weighted genotype matrix (\(GxV\))
- Return type
numpy.ndarray
- seak.kernels.single_column_kernel(i, scale_sqrt=True)[source]¶
Returns a kernel function with weights that correspond to a single column of the veps numpy.ndarray.
- Parameters
i (int) – column of which weights should be selected
- Returns
kernel function that linearly weights the genotypes with a single column of veps (e.g. corresponding to a specific transcription factor), needs to be called with genotypes and veps array
- Return type
function
The lrt module¶
Contains classes and functions for Likelihood ratio tests.
The class LRTnoK implements the set-based likelihood ratio test for continuous phenotypes. A background kernel is not supported.
Null model:
Alternative model:
With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates), \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for).
- class seak.lrt.LRTnoK(X, Y, REML=True)[source]¶
Bases:
objectSingle kernel Likelihood Ratio Test (LRT) for continuous phenotypes.
Sets up null model for given phenotypes and covariates. Use
altmodel()to fit an alternative model, and calculate the LRT test statistic. Usepv_sim()to simulate test statistics and generate empirical p-values.- Parameters
X (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)
Y (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)
REML (bool) – use restricted maximum likelihood? (only “True” has been tested)
- Variables
X – X
Y – Y
Xdagger – Moore-Penrose pseudo-inverse of X
model0 – dictionary with nLL of the null model
model1 – dictionary with nLL and other parameters of the alternative model
- altmodel(G1)[source]¶
Fit alternative model. Returns a dictionary with model parameters and test statistic.
- Parameters
G1 (numpy.ndarray) – \(G_1\): set of variants (i.e. variables) to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for)
- Returns
Dictionary of fitted model parameters
- Return type
dict
Output dictionary: ‘nLL’ : negative log-likelihood ‘sigma2’ : the model variance sigma^2 ‘stat’ : LRT test statistic ‘alteqnull’ : h2 is 0 or negative (i.e. p-value is 1.) ‘beta’ : [D*1] array of fixed effects weights beta ‘h2’ : mixture weight between Covariance and noise ‘REML’ : True: REML was computed, False: ML was computed ‘a2’ : mixture weight between K0 and K1 ‘scale’ : Scale parameter that multiplies the Covariance matrix (default 1.0) ————————————————————————–
- pv_5050(lik1=None)[source]¶
Return p-value(s) calculated assuming a 50/50 mixture of chi2 df0 and chi2 df1. Considered a conservative estimate. If lik1 is None, returns the p-value for the current alternative model.
- Parameters
lik1 (Union[float, numpy.ndarray]) – likelihood ratio test statistic
- Return type
Union[float, numpy.ndarray]
- pv_sim(nsim=100000, seed=420)[source]¶
Runs “nsim” simulations of the test statistic IF the likelihood of the alternative model is greater than that of the null model, i.e. self.model_lik[‘alteqnull’] is False. returns a dictionary with the simulations and empirical p-value. The empirical p-value is 1. if self.model_lik[‘alteqnull’] is True, and the array of simulated test statistics will be empty.
- Parameters
nsim (int) – Number of simulated test statistics to return
seed (int) – numpy random seed to use for simulations
- Returns
Dictionary with containing the simulated test statistics, and empirical p-value
- Return type
dict
- Output dictionary:
“res”: array of simulated test statistics “pv”: empirical p-value of current alternative model
- pv_sim_chi2(nsim=300, nsim0=100000, simzero=True, method='QR', seed=420, qmax=0.1)[source]¶
Calculate a gene-specific p-value by simulating nsim test statistics and estimating the mixture distribution as in Zhou 2016.
The chi2 distribution can either be
- Parameters
nsim (int) – Number of simulations for the (R)LRT test statistic
nsim0 (int) – Number of simulations to estimate point-mass at 0
method (str) – Method used to fit continuous part of the the chi2 mixure distribution. ‘MM’ -> moment matching, ‘ML’ -> maximum likelihood or ‘QR’ -> Quantile regression
qmax (float) – Use only the values in the top qmax quantile to fit the chi2 distribution. This only affects the QR-method
simzero (bool) – Whether to separately run simulations for the estimation of the mixture parameter (default: True). This makes sense if nsim0 >> nsim
- Returns
dictionary similar to that returned by pv_sim(), but with additional slots for estimated dof, scale and mixture parameters for the chi2 distribution
- seak.lrt.RLRTSim(X, Z, Xdagger, sqrtSigma=None, lambda0=nan, seed=2020, nsim=10000, use_approx=0, log_grid_hi=8, log_grid_lo=- 10, gridlength=200)[source]¶
Python port of the LRTSim function using scipy & numpy see https://cran.r-project.org/web/packages/RLRsim/RLRsim.pdf for details, also Crainiceanu, C. and Ruppert, D. (2004) Likelihood ratio tests in linear mixed models with one variance component
- Parameters
X (numpy.ndarray) – The fixed effects design matrix of the model under the alternative
Z (numpy.ndarray) – The random effects design matrix of the model under the alternative (i.e. G_1)
Xdagger (numpy.ndarray) – Moore-Penrose pseudo-inverse of X
sqrtSigma (numpy.ndarray) – The upper triangular Cholesky factor of the correlation matrix of the random effect
lambda0 (numpy.ndarray) –
seed – numpy random seed to use
nsim (int) – Number of values to simulate
use_approx (int) – Not implemented (yet)
log_grid_hi – Higher value of the grid on the log scale.
log_grid_lo – Lower value of the grid on the log scale.
gridlength (int) – Length of the grid for the grid search over lambda
- Returns
A dictionary containing the simulations and other parameters
- Return type
dict
- Output dictionary:
- “res”: np.array of simulated test statistics
…: parameters and intermediate variables used to generate simulated test statistics
- class seak.lrt.chi2mixture(lrt, mixture=None, tol=0.0, scalemin=0.1, scalemax=5.0, dofmin=0.1, dofmax=5.0, qmax=None, alteqnull=None, abserr=None, fitdof=None, dof=None)[source]¶
Bases:
objectmixture here denotes the weight on the non-zero dof compnent
- abserr¶
- alteqnull¶
- dof¶
- dofmax¶
- dofmin¶
- fit_params_Qreg()[source]¶
Fit the scale and dof parameters of the model by minimizing the squared error between the model log quantiles and the log P-values obtained on the lrt values. Only the top qmax quantile is being used for the fit (self.qmax is used in fit_scale_logP).
- fitdof¶
- imax¶
- isortlrt¶
- lrt¶
- lrtsort¶
- mixture¶
- qmax¶
- qnulllrtsort¶
- scale¶
- scalemax¶
- scalemin¶
- tol¶
- seak.lrt.estimate_pointmass0_null(mu, n, p, nsim=100000, xi=None, seed=None, REML=True)[source]¶
The point mass of the null distribution of the (R)LRT at 0 can be approximated by the probability of the (R)LRT test statistic having a local maximum at 0.
- See Crainiceanu, Ruppert: Likelihood ratio tests in linear mixed models
with one variance component (2003), Equations (8) & (9)
- and Zhou: Boosting Gene Mapping Power and Efficiency with Efficient Exact Variance
Component Tests of Single Nucleotide Polymorphism Sets (2016)
- Parameters
mu (numpy.ndarray) – Eigenvalues of the Genotype matrix with covariates projected out
n (float) – Number of individuals
p (float) – Number of fixed effect parameters (covariates)
nsim (float) – Number of simulations to perform (default: 100,000)
xi (numpy.ndarray) – Eigenvalues of the Genotype matrix, if REML == False
int – numpy seed
REML – use REML? (default: True)
- Returns
The probability of having point-mass at null
- Return type
float
- seak.lrt.fit_chi2_maximum_likelihood(sims)[source]¶
Fits a chi2 distribution using Maximum Likelihood
- Parameters
sims – array of LRT test statistics (continuous part of the distribution)
- Returns
Dictionary of distribution parameters estimated with Maximum Likelihood (ML)
- seak.lrt.fit_chi2_moment_matching(sims)[source]¶
Fits a chi2 distribution using moment matching (as in Zhou 2016)
- Parameters
sims – array of LRT test statistics (continuous part of the distribution)
- Returns
Dictionary of distribution parameters estimated with Moment Matching (MM)
- seak.lrt.fit_chi2mixture(sims, mixture=None, qmax=0.1)[source]¶
Takes simulated test statistics as input. Fits a chi2 mixture to them using “quantile regression” (Listgarten 2013). returns a dictionary. When mixture parameter is given it only fits scale and dof.
- Parameters
sims (numpy.ndarray) – array of tests statistics
mixture (float) – mixture weight
qmax (float) – fraction of largest test statistics to fit the distribution with
- Returns
Dictionary with chi2 mixture parameters
- Return type
dict
- Output dictionary:
“mse” : mean squared error “dof” : degrees of freedom “scale” : scale “imax” : number of values used to fit the ditribution
- seak.lrt.make_lambdagrid(lambda0, gridlength, log_grid_lo, log_grid_hi)[source]¶
Port of the R-function to create the lambda grid (used inside RLRTsim).
Creates a grid of lambda values to optimize over.
- seak.lrt.pv_chi2mixture(stat, scale, dof, mixture, alteqnull=None)[source]¶
Returns p-values(s) using chi2 with custom parameters.
wraps chi2.sf(stat/scale, dof) * mixture
- Parameters
stat (Union[float, np.array]) – The LRT test statistic
scale (float) – scaling factor applied to the test statistic
dof (float) – degrees of freedom
mixture (float) – mixture weight
- Returns
p-value
- Return type
float
- seak.lrt.rlrsim(p, k, n, nsim, g, q, mu, lambd, lambda0, xi=None, REML=True)[source]¶
Python port of the RLRsim.cpp function using scipy & numpy
This version uses broadcasting to speed up operations. It uses a lot of memory for large nsim, but is way faster than rlrsim_loop()
- Parameters
p (int) – number of covariates in X
k (int) – number of variables to test
n (int) – number of individuals
nsim (int) – number of sumulations to perform
g (int) – lambda grid length (not used anymore…)
q (int) – number of free parameters in the null-model (?)
mu (numpy.ndarray) – array of eigenvalues
lambd (numpy.ndarray) – array of lambda values
lambda0 (float) – lambda0 value
REML (bool) – perform REML (default=True)
- Xi
?
- seak.lrt.rlrsim_loop(p, k, n, nsim, g, q, mu, lambd, lambda0, xi=None, REML=True)[source]¶
Python port of the RLRsim.cpp function using scipy & numpy
This is a “naive” port with the same loops used in the C++ code It’s recommended to use rlrsim() instead, as it’s much faster. This function is mainly here for documentation purposes.
See
rlrsim()
The scoretest module¶
Contains classes for variance component score tests.
A single kernel and two-kernel score-based test is available for a linear link function (continuous phenotypes) (ScoretestNoK and Scoretest2K).
With the second kernel \(K_0\) correcting for population structure.
A single kernel score-based test is available for a logistic link function (binary phenotypes) (ScoretestLogit)
Null model single kernel:
Alternative model single kernel:
Null model two-kernel:
Alternative model two-kernel:
With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates), \(G_0\): variants to construct the background kernel \(K_0\) from, correcting for population structure, dimensions \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of variants/variants), \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for).
Note
For all classes of the module scoretest the class attributes are instance attributes! This is a bug in the automatic documentation.
Note
The source code of the mathematical implementation is adopted from the FaST-LMM Python package from Microsoft Corporation.
Source code can be found in fastlmm.association.score.
- exception seak.scoretest.DaviesError[source]¶
Bases:
Exceptionraise this error if Davie’s method does not converge
- class seak.scoretest.Scoretest(phenotypes, covariates)[source]¶
Bases:
objectSuperclass for all score-based set-association tests.
Defines all common attributes and implements methods that all subclasses share, such as methods for p-value computation for alternative
pv_alt_model()model.Interface for Scoretest subclasses.
- Variables
Y (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)
X (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)
N (int) – number of individuals
P (int) – number of phenotypes
D (int) – number of covariates
Neff (int) – effective sample size (number of individuals) as used in calculations of unbiased estimators
- D¶
- N¶
- Neff¶
- P¶
- X¶
- Y¶
- static has_bias(covariates)[source]¶
Checks whether at least one all constant column (bias) is present in the covariates.
- pv_alt_model(G1, G2=None, method='davies')[source]¶
Computes p-value of the alternative model.
- Parameters
G1 (numpy.ndarray) – set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants to test association for)
G2 (numpy.ndarray) – set of variants to condition on (\(n:=\) number of individuals, \(k:=\) number of variants to condition on)
- Returns
p-value of the alternative model
- Return type
float
- class seak.scoretest.Scoretest2K(phenotypes, covariates=None, K0=None, G0=None, forcefullrank=False)[source]¶
Bases:
seak.scoretest.ScoretestTwo-kernel score-based set-association test for continuous phenotypes.
Sets up null model for given phenotypes, covariates and background kernel \(K_0\) or background genotypes \(G_0\). If no covariates are given or if the covariates do not contain a bias column, a bias column is added. Compute p-value for alternative model with
pv_alt_model().Null model two-kernel:
\[Y = {\alpha} X + {\beta} G_0 + {\epsilon}\]Alternative model two-kernel:
\[Y = {\alpha} X + {\beta} G_0 + {\gamma} G_1 + {\epsilon}\]With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates) \(G_0\): variants to construct the background kernel \(K_0\) from, correcting for population structure, dimensions \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of variants/variants) \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for)
- Parameters
phenotypes (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)
covariates (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)
K0 (numpy.ndarray) – genetic similarity matrix/GRM \(K_0\) accounting for confounding
G0 (numpy.ndarray) – genotype matrix \(G_0\) used to contruct math:K_0 to account for confounding
forcefullrank (boolean) – for testing purposes only
- Variables
S – eigenvalues of PKP
Xdagger – Moore-Penrose pseudo-inverse of the covariate matrix X
sigma2e – environmental variance
sigma2g – genetic variance
- G0¶
- K0¶
- S¶
- U¶
- UUY¶
- UY¶
- Xdagger¶
- YUUY¶
- lowrank¶
- optparams¶
- sigma2e¶
- sigma2g¶
- class seak.scoretest.ScoretestLogit(phenotypes, covariates)[source]¶
Bases:
seak.scoretest.ScoretestSingle kernel score-based set-association test for binary phenotypes.
Sets up null model for given phenotypes and covariates. If no covariates are given or if the covariates do not contain a bias column, a bias column is added. Compute p-value for alternative model with
pv_alt_model().- Parameters
phenotypes (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)
covariates (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)
- VX¶
- pY¶
- pinvVX¶
- stdY¶
- class seak.scoretest.ScoretestNoK(phenotypes, covariates)[source]¶
Bases:
seak.scoretest.ScoretestSingle kernel score-based set-association test for continuous phenotypes.
Sets up null model for given phenotypes and covariates. If no covariates are given or if the covariates do not contain a bias column, a bias column is added. Compute p-value for alternative model with
pv_alt_model().Null model single kernel:
\[Y = {\alpha} X + {\epsilon}\]Alternative model single kernel:
\[Y = {\alpha} X + {\gamma} G_1 + {\epsilon}\]With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates) \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for)
- Parameters
phenotypes (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)
covariates (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)
- Variables
RxY – OLS residuals of the phenotype after regressing out fixed effects (covariates X)
Xdagger – Moore-Penrose pseudo-inverse of the covariate matrix X
sigma2 – environmental variance
- RxY¶
- Xdagger¶
- coef(G1)[source]¶
Returns single-variant regression coefficients and their estimated variances
- Parameters
G1 (numpy.ndarray) – N x 1 vector
- Return dict
dictionary with two slots: “beta” and “var_beta”
- sigma2¶