seak

The construct_background_kernel module

Computes a background kernel matrix \(K_0\) or respective genotype matrix \(G_0\) from binary PLINK 1 files (GRMLoader) or loads a precomputed matrix \(K_0\) from a NumPy file (GRM_from_file).

Needed, if a linear mixed model is used to correct for confounding. In low rank case (\(n>k\)): loads a genotype matrix \(G_0\). In full rank case (\(n<k\)): computes background kernel \(K_0\). With \(n:=\) number of individuals and \(k:=\) number of SNVs.

GRM: genetic relatedness matrix.

class seak.construct_background_kernel.GRMLoaderSnpReader(path_or_bed, blocksize, LOCO_chrom_id=None, forcelowrank=False)[source]

Bases: object

Constructs a background kernel \(K_0\) from given binary PLINK 1 genotype files using the leave-one-out-chromosome (LOCO) strategy.

Initially no background kernel is constructed, only the instance attributes are initialized. The kernel gets constructed when calling the method compute_background_kernel() which should only be called after calling the update_ind() method manually or the seak.data_loaders.intersect_and_update_datasets(). This way, individuals which are neither contained in the test nor in the background kernel data set are excluded.

In full rank case, loads the SNPs in blocks to construct the kernel. In low rank case, loads all SNPs into memory at once.

Parameters:
  • path_to_plink_files_with_prefix (str) – path prefix to genotype PLINK files for background kernel construction

  • blocksize (int) – how many genotypes to load at once; should be chosen dependent on RAM available

  • LOCO_chrom_id (str/int) – identifier of the chromosome/region that is used in the respective test set and should be excluded from the background kernel or None if all variants should be included

  • forcelowrank (bool) – enforce low rank data loading behavior for testing purposes

Note

The leave-one-chromosome-out (LOCO) strategy can be disabled with LOCO_chrom_id.

compute_background_kernel()[source]

Computes background kernel \(K_0\) for given set of genotypes (binary PLINK 1 files).

Overlap with data of set to be tested should have been carried out before, such that individuals in both data sets match. Does not return anything but sets instance attributes for either the background kernel \(K_0\) or the background kernel genotype matrix \(G_0\).

get_iids()[source]

Returns all individual ids. :return: :rtype: numpy.ndarray

update_individuals(iids)[source]

Sets individuals to include into the background kernel data set based on individual ids (iids).

Parameters:

iids – numpy.Series of individual ids that should be retained for background kernel computation

write_kernel(path, filetype='hdf5')[source]

Write constructed background kernel \(K_0\) to file, using eihter pysnptools.kernelreader.KernelHdf5 or pysnptools.kernelreader.KernelNpz.

Parameters:
  • path (str) – Path to the output file to be created.

  • filetype (str) – Either ‘hdf5’ or ‘npz’

class seak.construct_background_kernel.GRMLoader_from_file(path_to_GRM_npy, forcelowrank=False)[source]

Bases: object

Load a precomputed GRM from a NumPy file (.npy or .npz).

Must either be of low rank \(n>k\) or a square \(nxn\) matrix with \(n:=\) number of individuals and \(k:=\) number of SNVs.

The data_loaders module

Contains all data IO and processing functionalities to read and manipulate genotype, covariate and variant effect prediction data.

Implements interfaces for all data types such that you can write your custom data loading classes (VariantLoader, AnnotationLoader, CovariatesLoader).

Accepted input file format for genotype data is the binary PLINK 1 format. Accepted input file format for covariate and phenotype data is the CSV format. Accepted input file format for variant effect predictions is the HDF5 format, following the syntax of the output of Janggu.predict_variant_effect() (Janggu documentation) which needs to be accompanied by a index file in TSV format.

Note

The program assumes that genotype files are 1-based and fully closed, and variant effect prediction files 0-based and half-open.

class seak.data_loaders.AnnotationLoader[source]

Bases: object

Interface for loading variant annotations that implements data preprocessing methods.

anno_by_id(vids)[source]

Retrieves annotations for a set of variant ids (vids).

The order will match that of vids.

Parameters:

vids (pandas.Index) – names of variants to retrieve annotations for

Returns:

variant effect predictions for given variants

Return type:

numpy.ndarray

Raises:

NotImplementedError – Interface.

get_vids()[source]

Returns all variant ids.

Returns:

Returns all variant ids.

Return type:

pandas.Index

Raises:

NotImplementedError – Interface.

update_variants(vids=None, coordinates=None, exclude=False)[source]

Set variants to include/exclude based on variant ids (vids) or genetic coordinates (coordinates).

When exclude == False (default), this also changes the order to match the order of vids.

Parameters:
  • vids (pandas.Index) – Variant ids to include/exclude

  • coordinates (dict) – a dictionary of genomic coordinates {“chr”: str, “start”: int, “end”: int, “name”: str}

Raises:

NotImplementedError – Interface.

class seak.data_loaders.BEDRegionLoader(path_to_regions_UCSC_BED, chrom_to_load=None, drop_non_numeric_chromosomes=False)[source]

Bases: RegionLoader

Reads all regions in a UCSC BED-formatted file into memory.

Variables:

regions (pandas.DataFrame) – DataFrame storing regions in memory {“chr”: str, “start”: int, “end”: int, “name”: str}

regions
class seak.data_loaders.CovariatesLoader[source]

Bases: object

Stores covariates (fixed effects) and phenotypes.

get_iids()[source]

Returns all individual ids.

Returns:

Returns all individual ids.

Return type:

pandas.Index

update_individuals(iids, exclude=False)[source]

Set individuals to include/exclude.

When exclude == False (default), this also changes the order to match the order in iids.

Parameters:

iids (pandas.Index) – Individual ids to include/exclude

class seak.data_loaders.CovariatesLoaderCSV(phenotype_of_interest, path_to_covariates, covariate_column_names, sep=',', path_to_phenotypes=None)[source]

Bases: CovariatesLoader

Stores covariates (fixed effects) and phenotypes.

Initializes instance of a CovariatesLoader class from a CSV file.

Parameters:
  • phenotype_of_interest (str) – column name of phenotype that should be loaded

  • path_to_covariates (str) – path to CSV file with phenotype (if path_to_phenotypes is None) and covariates

  • covariate_column_names (list(str)) – list of the covariate column names

  • path_to_phenotypes (str) – path to CSV file with phenotype (if stored separately from covariates)

cov
covariate_column_names
get_iids()[source]

Returns all individual ids.

Returns:

Return type:

pandas.Index

get_one_hot_covariates_and_phenotype(test_type, pheno=None)[source]

Returns numpy.ndarray of the phenotype and one hot encoded covariates with invariant covariates removed and a bias column.

Make sure update_cov() was called beforehand, such that self.cov has the same order as the genotypes. Internally calls function pandas.get_dummies() which converts categorical variable into indicator variables.

Parameters:

test_type (string) – Either ‘logit’, ‘2K’ or ‘noK’

Returns:

data for phenotype and one hot encoded covariates with invariant covariates removed and bias column

Return type:

numpy.ndarray

phenotype_of_interest
update_individuals(iids, exclude=False)[source]

Set individuals to include/exclude.

When exclude == False (default), this also changes the order to match the order in iids.

Parameters:

iids (pandas.Index) – Individual ids to include/exclude

class seak.data_loaders.EnsemblVEPLoader(uploaded_variation, location, gene, allele=None, data=None)[source]

Bases: AnnotationLoader

Creates an AnnotationLoader from outputs columns of the Ensembl Variant Effect Predictor (VEP).

Specifically the columns ‘Uploaded_variation’, ‘Location’, ‘Allele’ and ‘Gene’. Additionally will take a numpy array or pandas dataframe with effects for those variants.

The format of the columns are described on: https://www.ensembl.org/info/docs/tools/vep/vep_formats.html#defaultout

Uploaded variation - as chromosome_start_alleles Location - in standard coordinate format (chr:start or chr:start-end) Allele - the variant allele used to calculate the consequence Gene - Ensembl stable ID of affected gene

Positions in “Location” are one-based. “Uploaded_variation” is used as the variant ID.

Parameters:
  • uploaded_variation – Series containing “Uploaded variation”

  • location – Series containing “Location”

  • allele – Series containing “Allele”

  • gene – Series containing “Gene”

  • data – DataFrame or ndarray containing the variant effect predictions

anno_by_id(vids=None, gene=None)[source]

Retrieves gene-specific variant effect predictions based on variant IDs (vids) and gene

Parameters:
  • vids – variant ids

  • gene – gene ids

  • default – if not None, default value to return when no variant effect predictions are found for a requested variant.

anno_by_interval(coordinates, gene=None)[source]

Retrieves gene-specific variant effect predictions based on position and gene

Parameters:
  • coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int}

  • gene – gene id

get_vids()[source]

Returns all variant ids.

Returns:

Returns all variant ids.

Return type:

pandas.Index

Raises:

NotImplementedError – Interface.

pos_df
update_variants(vids=None, coordinates=None, exclude=False)[source]

Set variants to include/exclude based on variant ids (vids) or genetic coordinates (coordinates).

Parameters:
  • vids (pandas.Index) – Variant ids to include/exclude

  • coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int} to include/exclude

vep_df
class seak.data_loaders.Hdf5Loader(path_to_vep_bed, path_to_vep_hdf5, hdf5_key, from_janggu=False)[source]

Bases: AnnotationLoader

Loads variant annotations from HDF5 files accompanied by a UCSC BED file or TSV with variant information.

The HDF5 files need to follow the syntax of the output of Janggu.predict_variant_effect() (Janggu documentation). You need to specify a key to the data. Instances of the class still needs to be overlapped with respective genotypes and vice versa. UCSC BED files/or TSV require first three fields: chrom, chromStart, chromEnd; we also require field name, that needs to match the names of the genotype data. Above that: if TSV and not USCS BED file is given make sure the genomic coordinates are 0-based an half-open.

Parameters:
  • path_to_vep_bed (str) – path to UCSC BED file containing meta info of variant effect predictions

  • path_to_vep_hdf5 (str) – path to HDF5 file containing the variant effect predictions

  • hdf5_key (str) – name of variant effect predict to load

  • from_janggu (bool) – whether janggu was used to generate the veps, if yes, naming convention is presupposed

anno_by_id(vids, shuffle=None)[source]

Retrieves annotations for a set of variant ids (vids).

The order will match that of vids.

Parameters:
  • vids (pandas.Index) – names of variants to retrieve annotations for

  • shuffle (int) – For testing purpose only; default None; shuffles loaded veps along axis 0 or 1

Returns:

variant effect predictions for given variants

Return type:

numpy.ndarray

from_janggu
get_vids()[source]

Returns all variant ids.

Returns:

Return type:

pandas.Index

set_mask(mask)[source]

Sets a boolean mask along the annotation-axis to be applied when loading data. Can be used to sub-set annotations.

Parameters:

mask (np.ndarray) – boolean mask to be applied to the columns (second dimension) of the annotations loaded.

update_variants(vids=None, coordinates=None, exclude=False)[source]

Set variants to include/exclude based on variant ids (vids) or genetic coordinates (coordinates).

When exclude == False (default), this also changes the order to match the order in vids. Make sure that the vids are in the order of the veps in the respective index file if working with HDF5 files. –> Reason: HDF5 files can only be accessed indexing elements in increasing order!

Parameters:
  • vids (pandas.Index) – Variant ids to include/exclude

  • coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int} to include/exclude

veps
veps_index_df
class seak.data_loaders.RegionLoader[source]

Bases: object

Interface for loading genomic regions whose subclasses can be used to group variants into sets to test jointly.

Subclasses need only implement __iter__, which yields 0-based, right-open (->left-closed) dictionaries of genomic intervals {“chr”: str, “start”: int, “end”: int, “name”: str}

class seak.data_loaders.VariantLoader[source]

Bases: object

Interface for loading genotype data for variants that implements data preprocessing methods.

static center(X, inplace=False)[source]

Mean centers genotype values, excluding missing values from computation.

Parameters:

X (numpy.ndarray) – 2D array with dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.

Returns:

mean-centered array

Return type:

numpy.ndarray

genotypes_by_id(vids)[source]

Retrieves genotypes for a set of variant ids (vids).

The ordering of the individuals (rows, axis 0) matches that returned by get_iids(). The ordering of the variants (columns, axis 1) matches that returned by get_vids(). To change the default ordering of variants use update_individuals().

Parameters:

vids (pandas.Index) – one or more variant ids to retrieve genotypes for

Returns:

tuple of genotypes for region of interest (numpy.ndarray (\(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs)); and pandas.DataFrame with names/vids of respective genotypes as index

Return type:

(numpy.ndarray, pandas.Index)

Raises:

NotImplementedError: Interface.

genotypes_by_region(coordinates)[source]

Given a region of interest returns the variants that lie in the respective region.

The order of the individuals (rows, axis 0) matches that returned by get_iids(). The order of the variants (columns, axis 1) matches that returned by get_vids(). To change the default order of individuals or genotypes use update_individuals() or update_variants(), respectively.

Parameters:

coordinates (dict) – a dictionary of genomic coordinates {“chr”: str, “start”: int, “end”: int}

Returns:

tuple of genotypes for region of interest (numpy.ndarray (\(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs)); and pandas.DataFrame with names/vids of respective genotypes as index

Return type:

(numpy.ndarray, dict, pandas.Index)

Raises:

NotImplementedError: Interface.

get_iids()[source]

Returns all individual ids.

Returns:

Returns all individual ids.

Return type:

np.ndarray

Raises:

NotImplementedError – Interface.

get_vids()[source]

Returns all variant ids.

Returns:

Returns all variant ids.

Return type:

np.ndarray

Raises:

NotImplementedError – Interface.

static mean_imputation(X)[source]

Replaces missing genotype values (np.nan) by the respective mean value for that variant.

Excludes missing values from computation.

Parameters:

X (numpy.ndarray) – 2D array with dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.

Returns:

array with missing values mean imputed along specified axis

Return type:

numpy.ndarray

static preprocess_genotypes(genotypes, vids, impute_mean=True, center=False, scale=False, normalize=False, invert_encoding=True, recode_maf=False, remove_singletons=False, remove_doubletons=False, min_maf=0, max_maf=1)[source]

Filters input variants and respective vids according to selected parameters.

MAF: minor allele frequency.

Parameters:
  • genotypes (numpy.ndarray) – \(n*m\) genotype matrix (with \(n:=\) number of individuals and \(m:=\) number of SNVs)

  • vids (pandas.Series or pandas.DataFrame) – information on the variants to be passed along (typically the variant ids)

  • impute_mean (bool) – whether missing values should be replaced by the mean

  • center (bool) – whether to mean center the data

  • scale (bool) – whether to scale the data by the standard deviation

  • normalize (bool) – whether to mean center and standardize the data by standard deviation

  • invert_encoding (bool) – whether to invert the encoding of the genotypes (pandas_plink which is used for data loading counts major alleles by default, thus by default we invert the encodings, such that the minor allele is counted)

  • recode_maf (bool) – encode genotypes according to sample, minor allele is counted (additive 0, 1, 2)

  • remove_singletons (bool) – whether to remove singleton genotypes (minor allele count 1); by default also removes singletons in case of reverted genotypes for the sample

  • remove_doubletons (bool) – whether to remove doubleton genotypes (minor allele count 2); by default also removes doubletons in case of reverted genotypes for the sample

  • min_maf (float) – variants with MAF smaller than min_maf in the sample are excluded

  • max_maf (float) – variants with MAF larger than max_maf in the sample are excluded

Returns:

filtered genotypes, index of variants that remain after filtering

Return type:

numpy.ndarray, pandas.Index

static scale(X, inplace=False)[source]

Scales the genotype data by standard deviation, excluding missing values from computation.

Parameters:

X (numpy.ndarray) – 2D array wit dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.

Returns:

normalized array

Return type:

numpy.ndarray

static standardize(X, inplace=False)[source]
Z-score normalizes the genotype data, excluding missing values from computation. If inplace == True, performs

operations inplace and returns a reference to X itself.

src/seak/data_loaders.p :param numpy.ndarray X: 2D array with dimensions \(n*m\) with \(n:=\) number of individuals and \(m:=\) number of SNVs.

:: :return: z-score normalized array :rtype: numpy.ndarray

update_individuals(iids, exclude=False)[source]

Set individuals to include/exclude.

When exclude == False (default), this also changes the order to match the order in iids.

Parameters:

iids (pandas.Index) – Individual ids to include/exclude

Raises:

NotImplementedError – Interface.

update_variants(vids=None, coordinates=None, exclude=False)[source]

Set variants to include/exclude based on variant ids (vids) or genetic coordinates (coordinates).

When exclude == False (default), this also changes the order to match the order in vids.

Parameters:
  • vids (pandas.Index) – Variant ids to include/exclude

  • coordinates (dict) – genomic coordinates {“chr”: str, “start”: int, “end”: int} to include/exclude

Raises:

NotImplementedError – Interface.

class seak.data_loaders.VariantLoaderSnpReader(path_or_bed)[source]

Bases: VariantLoader

Loads variants (SNVs) by wrapping SnpReader objects.

Can be initialized from either an SnpReader, or a path to a Plink bed-file.

Has methods to subset the dataset based on variant ids and individual ids. Can retrieve genotypes of a specific region or by variant id.

Parameters:

path_or_bed (str) – SnpReader or path to binary PLINK 1 bed-file

bed
genotypes_by_id(vids, return_pos=False, missing_ok=False)[source]

Retrieves genotypes for a set of variant ids (vids).

The ordering of the individuals (rows, axis 0) matches that of vids. The ordering of the variants (columns, axis 1) matches that returned by get_vids(). To change the default ordering of variants use update_variants().

Parameters:

vids (pandas.Index) – one or more variant ids to retrieve genotypes for

Returns:

tuple of genotypes for region of interest (numpy.ndarray (number of individuals x number of SNVs)); and pandas.DataFrame with names/vids of respective genotypes as index

Return type:

(numpy.ndarray, pandas.Index)

genotypes_by_region(coordinates, return_pos=False)[source]

Given a region of interest returns the variants that lie in the respective region.

The order of the individuals (rows, axis 0) matches that returned by get_iids(). The order of the variants (columns, axis 1) matches that returned by get_vids(). To change the default order of individuals or genotypes use update_individuals() or update_variants(), respectively If no variants lie in the requested region the method returns (None, None).

Parameters:
  • coordinates (dict) – a dictionary of genomic coordinates {“chr”: str, “start”: int, “end”: int}

  • return_pos (bool) – a boolean indicating whether variant positions should be returned

Returns:

tuple of genotypes for region of interest (numpy.ndarray (number of individuals x number of SNVs)); and pandas.Index with names/vids of respective variants; If return_pos is True, a pandas.Series with variant positions.

Return type:

(numpy.ndarray, pandas.Index, pandas.Series)

get_iids()[source]

Returns all individual ids.

Returns:

Return type:

pandas.Index

get_vids()[source]

Returns all variant ids.

Returns:

Return type:

pandas.Index

iid_fid
update_individuals(iids, exclude=False)[source]

Set individuals to include/exclude.

When exclude == False (default), this also changes the order to match the order in iids:.

Parameters:

iids – Individual ids to include/exclude

update_variants(vids=None, coordinates=None, exclude=False)[source]

Set variants to include/exclude based on variant ids (vids) or genetic coordinates (coordinates).

When exclude == False (default), this also changes the order to match the order in vids. Make sure that the vids are in the order of the veps in the respective index file if working with HDF5 files. –> Reason: HDF5 files can only be accessed indexing elements in increasing order!

repeated IDs are ignored.

Parameters:
  • vids – Variant ids to include/exclude

  • coordinates (dict) – genomic coordsrc/seak/data_loaders.pinates {“chr”: str, “start”: int, “end”: int} to include/exclude

seak.data_loaders.intersect_and_update_datasets(test, variantloader, covariateloader, annotationloader=None, grmloader=None)[source]

Creates smallest common subset of class instances provided with respect to variants and individuals.

Modifies instances inplace.

For 2K case also computed background kernel \(K_0\).

Parameters:
  • test (string) – [‘2K’, ‘noK’, ‘logit’]

  • variantloader (VariantLoader) – instance of respective class to be modified

  • covariateloader (CovariatesLoader) – instance of respective class to be modified

  • annotationloader (AnnotationLoader) – instance of respective class to be modified

  • grmloader (GRMLoader) – instance of respective class to be modified

seak.data_loaders.intersect_ids(ar1, ar2)[source]

returns the intersection (common elements) of ar1 and ar2, in the order that they first appear in ar1.

Parameters:
  • ar1 (np.ndarray) – Array of ids

  • ar2 (np.ndarray) – Array of ids

The kernels module

Functions that can incorporate prior knowledge into set-association tests.

Any function that takes a genotype matrix G, variant effects V, and optional additional arguments and returns a set of transformed genotypes/variables to be tested.

class seak.kernels.LocalCollapsing(distance_threshold=100.0)[source]

Bases: object

Class for local collapsing of GV, based on a set of positions

Parameters:

distance_threshold (float) – maximum distance allowed between SNPs within the same cluster

collapse(G, pos, weights=1.0)[source]

Collapses G locally, based on pos

Parameters:
  • G (np.ndarray) – Matrix containing the values to be collapsed, typically shape = (n_individuals, n_SNP)

  • pos (np.ndarray) – array-like, passed to sklearn.cluster.AgglomerativeClustering typically shape (n_samples, 1) for base pair positional clustering, but supports (n_samples, n_features) or (n_samples, n_samples).

  • weights (np.array) – single variant weights, shape = (n_SNP, ), default: equal weights for all variants

Returns:

Collapsed matrix with shape (n_individuals, n_clusters), and the cluster assignments (n_individuals, )

Return type:

tuple (np.ndarray, np.ndarray)

seak.kernels.diffscore_max(G, V, scale_sqrt=True)[source]

Uses the largest absolute variant effect predictions (veps) per SNV.

Linear weighted kernel.

Parameters:
  • G (numpy.ndarray) – SNVs to be tested (genotypes), \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of SNVs)

  • V (numpy.ndarray) – veps \(mxk\) (\(m:=\) number of SNVs, \(k:=\) number of veps); dimension \(m\) needs to be equal in G and V

Returns:

linear weighted genotype matrix (\(GxV\))

Return type:

numpy.ndarray

seak.kernels.linear(G, V)[source]

Linear kernel, does not scale the genotypes.

seak.kernels.linear_weighted(G, V)[source]

Linear weighted kernel.

Scales the SNVs by the variant effect predictions (veps) corresponding of a 1-dimensional array which gets transformed into a diagonal weight matrix as linear weights.

Parameters:
  • G (numpy.ndarray) – SNVs to be tested (genotypes), \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of SNVs)

  • V (numpy.ndarray) – veps \(mx1\) (\(m:=\) number of SNVs) with dimension 1 and shape: \(m\)

Returns:

linear weighted genotype matrix (\(GxV\))

Return type:

numpy.ndarray

seak.kernels.single_column_kernel(i, scale_sqrt=True)[source]

Returns a kernel function with weights that correspond to a single column of the veps numpy.ndarray.

Parameters:

i (int) – column of which weights should be selected

Returns:

kernel function that linearly weights the genotypes with a single column of veps (e.g. corresponding to a specific transcription factor), needs to be called with genotypes and veps array

Return type:

function

The lrt module

Contains classes and functions for Likelihood ratio tests.

The class LRTnoK implements the set-based likelihood ratio test for continuous phenotypes. A background kernel is not supported.

Null model:

\[y = {\alpha} X + {\epsilon}\]

Alternative model:

\[y = {\alpha} X + {\gamma} G_1 + {\epsilon}\]

With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates), \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for).

class seak.lrt.LRTnoK(X, Y, REML=True)[source]

Bases: object

Single kernel Likelihood Ratio Test (LRT) for continuous phenotypes.

Sets up null model for given phenotypes and covariates. Use altmodel() to fit an alternative model, and calculate the LRT test statistic. Use pv_sim() to simulate test statistics and generate empirical p-values.

Parameters:
  • X (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)

  • Y (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)

  • REML (bool) – use restricted maximum likelihood? (only “True” has been tested)

Variables:
  • X – X

  • Y – Y

  • Xdagger – Moore-Penrose pseudo-inverse of X

  • model0 – dictionary with nLL of the null model

  • model1 – dictionary with nLL and other parameters of the alternative model

altmodel(G1)[source]

Fit alternative model. Returns a dictionary with model parameters and test statistic.

Parameters:

G1 (numpy.ndarray) – \(G_1\): set of variants (i.e. variables) to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for)

Returns:

Dictionary of fitted model parameters

Return type:

dict

Output dictionary: ‘nLL’ : negative log-likelihood ‘sigma2’ : the model variance sigma^2 ‘stat’ : LRT test statistic ‘alteqnull’ : h2 is 0 or negative (i.e. p-value is 1.) ‘beta’ : [D*1] array of fixed effects weights beta ‘h2’ : mixture weight between Covariance and noise ‘REML’ : True: REML was computed, False: ML was computed ‘a2’ : mixture weight between K0 and K1 ‘scale’ : Scale parameter that multiplies the Covariance matrix (default 1.0) ————————————————————————–

pv_5050(lik1=None)[source]

Return p-value(s) calculated assuming a 50/50 mixture of chi2 df0 and chi2 df1. Considered a conservative estimate. If lik1 is None, returns the p-value for the current alternative model.

Parameters:

lik1 (Union[float, numpy.ndarray]) – likelihood ratio test statistic

Return type:

Union[float, numpy.ndarray]

pv_sim(nsim=100000, seed=420)[source]

Runs “nsim” simulations of the test statistic IF the likelihood of the alternative model is greater than that of the null model, i.e. self.model_lik[‘alteqnull’] is False. returns a dictionary with the simulations and empirical p-value. The empirical p-value is 1. if self.model_lik[‘alteqnull’] is True, and the array of simulated test statistics will be empty.

Parameters:
  • nsim (int) – Number of simulated test statistics to return

  • seed (int) – numpy random seed to use for simulations

Returns:

Dictionary with containing the simulated test statistics, and empirical p-value

Return type:

dict

Output dictionary:

“res”: array of simulated test statistics “pv”: empirical p-value of current alternative model

pv_sim_chi2(nsim=300, nsim0=100000, simzero=True, method='QR', seed=420, qmax=0.1)[source]

Calculate a gene-specific p-value by simulating nsim test statistics and estimating the mixture distribution as in Zhou 2016.

Parameters:
  • nsim (int) – Number of simulations for the (R)LRT test statistic

  • nsim0 (int) – Number of simulations to estimate point-mass at 0

  • method (str) – Method used to fit continuous part of the the chi2 mixure distribution. ‘MM’ -> moment matching, ‘ML’ -> maximum likelihood or ‘QR’ -> Quantile regression

  • qmax (float) – Use only the values in the top qmax quantile to fit the chi2 distribution. This only affects the QR-method

  • simzero (bool) – Whether to separately run simulations for the estimation of the mixture parameter (default: True). This makes sense if nsim0 >> nsim

Returns:

dictionary similar to that returned by pv_sim(), but with additional slots for estimated dof, scale and mixture parameters for the chi2 distribution

seak.lrt.RLRTSim(X, Z, Xdagger, sqrtSigma=None, lambda0=nan, seed=2020, nsim=10000, use_approx=0, log_grid_hi=8, log_grid_lo=-10, gridlength=200)[source]

Python port of the LRTSim function using scipy & numpy see https://cran.r-project.org/web/packages/RLRsim/RLRsim.pdf for details, also Crainiceanu, C. and Ruppert, D. (2004) Likelihood ratio tests in linear mixed models with one variance component

Parameters:
  • X (numpy.ndarray) – The fixed effects design matrix of the model under the alternative

  • Z (numpy.ndarray) – The random effects design matrix of the model under the alternative (i.e. G_1)

  • Xdagger (numpy.ndarray) – Moore-Penrose pseudo-inverse of X

  • sqrtSigma (numpy.ndarray) – The upper triangular Cholesky factor of the correlation matrix of the random effect

  • lambda0 (numpy.ndarray) –

  • seed – numpy random seed to use

  • nsim (int) – Number of values to simulate

  • use_approx (int) – Not implemented (yet)

  • log_grid_hi – Higher value of the grid on the log scale.

  • log_grid_lo – Lower value of the grid on the log scale.

  • gridlength (int) – Length of the grid for the grid search over lambda

Returns:

A dictionary containing the simulations and other parameters

Return type:

dict

Output dictionary:
“res”: np.array of simulated test statistics

…: parameters and intermediate variables used to generate simulated test statistics

class seak.lrt.chi2mixture(lrt, mixture=None, tol=0.0, scalemin=0.1, scalemax=5.0, dofmin=0.1, dofmax=5.0, qmax=None, alteqnull=None, abserr=None, fitdof=None, dof=None)[source]

Bases: object

mixture here denotes the weight on the non-zero dof compnent

abserr
alteqnull
dof
dofmax
dofmin
fit_params_Qreg()[source]

Fit the scale and dof parameters of the model by minimizing the squared error between the model log quantiles and the log P-values obtained on the lrt values. Only the top qmax quantile is being used for the fit (self.qmax is used in fit_scale_logP).

fit_scale_logP(dof=None)[source]

Extracts the top qmax lrt values to do the fit.

fitdof
imax
isortlrt
lrt
lrtsort
mixture
qmax
qnulllrtsort
scale
scale_dof_obj(scale, dof)[source]
scalemax
scalemin
tol
seak.lrt.estimate_pointmass0_null(mu, n, p, nsim=100000, xi=None, seed=None, REML=True)[source]

The point mass of the null distribution of the (R)LRT at 0 can be approximated by the probability of the (R)LRT test statistic having a local maximum at 0. See Crainiceanu, Ruppert: Likelihood ratio tests in linear mixed models

with one variance component (2003), Equations (8) & (9)

and Zhou: Boosting Gene Mapping Power and Efficiency with Efficient Exact Variance

Component Tests of Single Nucleotide Polymorphism Sets (2016)

Parameters:
  • mu (numpy.ndarray) – Eigenvalues of the Genotype matrix with covariates projected out

  • n (float) – Number of individuals

  • p (float) – Number of fixed effect parameters (covariates)

  • nsim (float) – Number of simulations to perform (default: 100,000)

  • xi (numpy.ndarray) – Eigenvalues of the Genotype matrix, if REML == False

  • int – numpy seed

  • REML – use REML? (default: True)

Returns:

The probability of having point-mass at null

Return type:

float

seak.lrt.fit_chi2_maximum_likelihood(sims)[source]

Fits a chi2 distribution using Maximum Likelihood. (wraps the sci)

This has not been tested and is not recommended.

Parameters:

sims – array of LRT test statistics (continuous part of the distribution)

Returns:

Dictionary of distribution parameters estimated with Maximum Likelihood (ML)

seak.lrt.fit_chi2_moment_matching(sims)[source]

Fits a chi2 distribution using moment matching (as in Zhou 2016)

This has not been tested and is not recommended.

Parameters:

sims – array of LRT test statistics (continuous part of the distribution)

Returns:

Dictionary of distribution parameters estimated with Moment Matching (MM)

seak.lrt.fit_chi2mixture(sims, mixture=None, qmax=0.1)[source]

Takes simulated test statistics as input. Fits a chi2 mixture to them using “quantile regression” (Listgarten 2013). returns a dictionary.When mixture parameter is given it only fits scale and dof.

This is the preferred method.

Parameters:
  • sims (numpy.ndarray) – array of tests statistics

  • mixture (float) – mixture weight

  • qmax (float) – fraction of largest test statistics to fit the distribution with

Returns:

Dictionary with chi2 mixture parameters

Return type:

dict

Output dictionary:

“mse” : mean squared error “dof” : degrees of freedom “scale” : scale “imax” : number of values used to fit the ditribution

seak.lrt.make_lambdagrid(lambda0, gridlength, log_grid_lo, log_grid_hi)[source]

Port of the R-function to create the lambda grid (used inside RLRTsim).

Creates a grid of lambda values to optimize over.

seak.lrt.pv_chi2mixture(stat, scale, dof, mixture, alteqnull=None)[source]

Returns p-values(s) using chi2 with custom parameters. see fit_chi2mixture() wraps chi2.sf(stat/scale, dof) * mixture :param stat: The LRT test statistic :type stat: Union[float, np.array] :param float scale: scaling factor applied to the test statistic :param float dof: degrees of freedom :param float mixture: mixture weight :return: p-value :rtype: float

seak.lrt.rlrsim(p, k, n, nsim, g, q, mu, lambd, lambda0, xi=None, REML=True)[source]

Python port of the RLRsim.cpp function using scipy & numpy

This version uses broadcasting + cython to speed up operations. It is about 3-10 times faster than the numpy-version (rlrsim_np).

Parameters:
  • p (int) – number of covariates in X

  • k (int) – number of variables to test

  • n (int) – number of individuals

  • nsim (int) – number of sumulations to perform

  • g (int) – lambda grid length

  • q (int) – number of free parameters in the null-model (?)

  • mu (numpy.ndarray) – array of eigenvalues

  • lambd (numpy.ndarray) – array of lambda values

  • lambda0 (float) – lambda0 value

  • REML (bool) – perform REML (default=True)

Xi:

?

seak.lrt.rlrsim_loop(p, k, n, nsim, g, q, mu, lambd, lambda0, xi=None, REML=True)[source]

Python port of the RLRsim.cpp function using scipy & numpy

This is a “naive” port with the same loops used in the C++ code It’s recommended to use rlrsim() instead, as it’s much faster. This function is mainly here for documentation purposes.

See rlrsim()

seak.lrt.rlrsim_np(p, k, n, nsim, g, q, mu, lambd, lambda0, xi=None, REML=True)[source]

Python port of the RLRsim.cpp function using scipy & numpy

This version uses broadcasting to speed up operations. It uses a lot of memory for large nsim, but is way faster than rlrsim_loop()

Parameters:
  • p (int) – number of covariates in X

  • k (int) – number of variables to test

  • n (int) – number of individuals

  • nsim (int) – number of sumulations to perform

  • g (int) – lambda grid length (not used anymore…)

  • q (int) – number of free parameters in the null-model (?)

  • mu (numpy.ndarray) – array of eigenvalues

  • lambd (numpy.ndarray) – array of lambda values

  • lambda0 (float) – lambda0 value

  • REML (bool) – perform REML (default=True)

Xi:

?

The scoretest module

Contains classes for variance component score tests.

A single kernel and two-kernel score-based test is available for a linear link function (continuous phenotypes) (ScoretestNoK and Scoretest2K). With the second kernel \(K_0\) correcting for population structure. A single kernel score-based test is available for a logistic link function (binary phenotypes) (ScoretestLogit)

Null model single kernel:

\[y = {\alpha} X + {\epsilon}\]

Alternative model single kernel:

\[y = {\alpha} X + {\gamma} G_1 + {\epsilon}\]

Null model two-kernel:

\[y = {\alpha} X + {\beta} G_0 + {\epsilon}\]

Alternative model two-kernel:

\[y = {\alpha} X + {\beta} G_0 + {\gamma} G_1 + {\epsilon}\]

With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates), \(G_0\): variants to construct the background kernel \(K_0\) from, correcting for population structure, dimensions \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of variants/variants), \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for).

Note

For all classes of the module scoretest the class attributes are instance attributes! This is a bug in the automatic documentation.

Note

The source code of the mathematical implementation is adopted from the FaST-LMM Python package from Microsoft Corporation. Source code can be found in fastlmm.association.score.

exception seak.scoretest.DaviesError[source]

Bases: Exception

raise this error if Davie’s method does not converge

class seak.scoretest.Scoretest(phenotypes, covariates)[source]

Bases: object

Superclass for all score-based set-association tests.

Defines all common attributes and implements methods that all subclasses share, such as methods for p-value computation for alternative pv_alt_model() model.

Interface for Scoretest subclasses.

Variables:
  • Y (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)

  • X (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)

  • N (int) – number of individuals

  • P (int) – number of phenotypes

  • D (int) – number of covariates

  • Neff (int) – effective sample size (number of individuals) as used in calculations of unbiased estimators

D
N
Neff
P
X
Y
static has_bias(covariates)[source]

Checks whether at least one all constant column (bias) is present in the covariates.

pv_alt_model(G1, G2=None, method='davies')[source]

Computes p-value of the alternative model.

Parameters:
  • G1 (numpy.ndarray) – set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants to test association for)

  • G2 (numpy.ndarray) – set of variants to condition on (\(n:=\) number of individuals, \(k:=\) number of variants to condition on)

Returns:

p-value of the alternative model

Return type:

float

class seak.scoretest.Scoretest2K(phenotypes, covariates=None, K0=None, G0=None, forcefullrank=False)[source]

Bases: Scoretest

Two-kernel score-based set-association test for continuous phenotypes.

Sets up null model for given phenotypes, covariates and background kernel \(K_0\) or background genotypes \(G_0\). If no covariates are given or if the covariates do not contain a bias column, a bias column is added. Compute p-value for alternative model with pv_alt_model().

Null model two-kernel:

\[Y = {\alpha} X + {\beta} G_0 + {\epsilon}\]

Alternative model two-kernel:

\[Y = {\alpha} X + {\beta} G_0 + {\gamma} G_1 + {\epsilon}\]

With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates) \(G_0\): variants to construct the background kernel \(K_0\) from, correcting for population structure, dimensions \(nxm\) (\(n:=\) number of individuals, \(m:=\) number of variants/variants) \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for)

Parameters:
  • phenotypes (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)

  • covariates (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)

  • K0 (numpy.ndarray) – genetic similarity matrix/GRM \(K_0\) accounting for confounding

  • G0 (numpy.ndarray) – genotype matrix \(G_0\) used to contruct math:K_0 to account for confounding

  • forcefullrank (boolean) – for testing purposes only

Variables:
  • S – eigenvalues of PKP

  • Xdagger – Moore-Penrose pseudo-inverse of the covariate matrix X

  • sigma2e – environmental variance

  • sigma2g – genetic variance

G0
K0
S
U
UUY
UY
Xdagger
YUUY
lowrank
optparams
sigma2e
sigma2g
class seak.scoretest.ScoretestLogit(phenotypes, covariates)[source]

Bases: Scoretest

Single kernel score-based set-association test for binary phenotypes.

Sets up null model for given phenotypes and covariates. If no covariates are given or if the covariates do not contain a bias column, a bias column is added. Compute p-value for alternative model with pv_alt_model().

Parameters:
  • phenotypes (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)

  • covariates (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)

VX
pY
pinvVX
stdY
class seak.scoretest.ScoretestNoK(phenotypes, covariates)[source]

Bases: Scoretest

Single kernel score-based set-association test for continuous phenotypes.

Sets up null model for given phenotypes and covariates. If no covariates are given or if the covariates do not contain a bias column, a bias column is added. Compute p-value for alternative model with pv_alt_model().

Null model single kernel:

\[Y = {\alpha} X + {\epsilon}\]

Alternative model single kernel:

\[Y = {\alpha} X + {\gamma} G_1 + {\epsilon}\]

With: \(X\): covariates, dimension \(nxc\) (\(n:=\) number of individuals, \(c:=\) number of covariates) \(G_1\): set of variants to test, dimensions \(nxk\) (\(n:=\) number of individuals, \(k:=\) number of variants in set to test association for)

Parameters:
  • phenotypes (numpy.ndarray) – \(nx1\) matrix containing the phenotype (\(n:=\) number of individuals)

  • covariates (numpy.ndarray) – \(nxc\) matrix containing the covariates (\(n:=\) number of individuals, \(c:=\) number of covariates)

Variables:
  • RxY – OLS residuals of the phenotype after regressing out fixed effects (covariates X)

  • Xdagger – Moore-Penrose pseudo-inverse of the covariate matrix X

  • sigma2 – environmental variance

RxY
Xdagger
coef(G1)[source]

Returns single-variant regression coefficients and their estimated variances

Parameters:

G1 (numpy.ndarray) – N x 1 vector

Return dict:

dictionary with two slots: “beta” and “var_beta”

sigma2