Overview¶

seak, which stands for sequence annotations in kernel-based tests, is an open-source Python software package for performing set-based genotype-phenotype association tests. It allows for the flexible incorporation of prior knowledge, such as variant effect predictions or other annotations, into these association tests.

The mathematical implementation of these tests is based on FaST-LMM-Set [1] [2].

Two types of association tests are available, namely to score test (seak.scoretest) and the likelihood ratio test (LRT, seak.lrt). While the score test is computationally more efficient, the LRT has potentially higher power [1].

The score test is available for continuous (seak.scoretest.ScoretestLogit) and binary phenotypes (seak.scoretest.ScoretestNoK), and can correct for (cryptic) relatedness and population stratification using a two random effects model (seak.scoretest.Scoretest2K, continuous phenotypes only). P-values are calculated using either Davie’s exact method [3], or saddle point approximation [4].

The LRT is implemented for continuous phenotypes (seak.lrt.LRTnoK). LRT test statistics can be sampled using the fast implementations described in [5].

Seak provides interfaces for data loading functionalities (seak.data_loaders) in order to maximize flexibility. This way users can easily adapt the package to the input data types of their choice.

Free software: Apache Software License 2.0

Installation¶

The installation of seak requires Python 3.7+ and the packages numpy and cython. All other dependencies are installed automatically when installing the package.

Clone the repository. Then, on the command line:

pip install -e ./seak

Documentation¶

For a reference documenting all public modules included in seak meant for general usage see: API reference.

Tutorial¶

A small example illustrating how to perform score- and likelihood ratio tests is shown in: Tutorial.

A pipeline using seak to perform functionally informed association tests on UK Biobank data is available here

References¶

For more information on FaST-LMM visit FaST-LMM.

1(1,2): Jennifer Listgarten, Christoph Lippert, Eun Yong Kang, Jing Xiang, Carl M. Kadie, and David Heckerman. A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics, 29(12):1526–1533, 2013. URL: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btt177, doi:10.1093/bioinformatics/btt177.
2: Christoph Lippert, Jing Xiang, Danilo Horta, Christian Widmer, Carl Kadie, David Heckerman, and Jennifer Listgarten. Greater power and computational efficiency for kernel-based association testing of sets of genetic variants. Bioinformatics, 30(22):3206–3214, 2014. URL: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btu504, doi:10.1093/bioinformatics/btu504.
3: Robert B Davies. Algorithm as 155: the distribution of a linear combination of χ 2 random variables. Applied Statistics, pages 323–333, 1980.
4: Diego Kuonen. Miscellanea. saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika, 86(4):929–935, 1999.
5: Fabian Scheipl, Sonja Greven, and Helmut Kuechenhoff. Size and power of tests for a zero random effect variance or polynomial regression in additive and linear mixed models. Computational statistics & data analysis, 52(7):3283–3299, 2008.