Model implementation

 
 

Item summary

This portfolio item highlights elements from the implementation of large statistical software, a multivariate regression program for high-dimensional count data. It thereby demonstrates my ability to perform statistical model selection, work with high-dimensional data, implement estimation software (non-linear optimization), run hypothesis tests, evaluate these tests with precision-recall curves, and apply the entire multivariate regression software toward a scientific problem. This is a brief description of content. More detail can be found in my thesis (which will be linked upon completion of my M.Sc.). Code for this item is available at this link.

This work was done for my M.Sc. Bioinformatics degree. Correlation analyses have been a popular tool in attempting to understand how microbial communities chemically process their environments. In each of 112 samples, genetic sequencing is used to approximate counts of thousands of different kinds of microbes. A recent academic work (Weiss, S. et al., 2016) demonstrated that popular correlation methods applied toward this kind of data suffers from poor precision-recall exchanges, meaning they often make mistakes. Poor precision largely denies useful application of correlation analyses toward understanding microbial processes. 

My multivariate count regression software employs a regularization scheme which I show to significantly improve precision-recall exchanges. 

The data

Prior to statistical processing, our microbial count data is simply a p-by-112 matrix, where p is the number of taxa. Depending on how we collapse our phylogenetic tree, p can be arbitrarily large or small. Because we hope to estimate pair-wise correlations (requiring many parameters), we will collapse our tree at p=50.

The problem

Pair-wise correlation estimation for p taxa implicitly imposes a model with O(p-squared) parameters. For example, when p=50 we have 1225 correlations to estimate. This is far too many parameters for 112 samples, and results in over-fit. In turn, over-fit reduces our predictive capacity and thus ruins our precision.

The Solution

To reduce over-fit, we approximate our covariance matrix with a factor model. The primary advantage of this modelling choice is that it only has O(mp) parameters, but still describes O(p-squared) correlations. I later apply this model with m=3, leaving us with p*(m+1)=50*(3+1)=200 parameters describing the covariance, which is significantly fewer than the previous 1225. Because the number of parameters is still greater than the number of samples, it was not initially obvious that the model would end up performing as well as it did.

The factor model covariance structure

The factor model covariance structure

Gaussian copula

Factor models are designed for Gaussian-distributed data--meaning the data distribute continuously like a bell-curve. However, count data are never Gaussian-distributed. Copula is a math tool which allows multivariate and marginal distributions to be selected separately. Essentially, it provides a modeling convenience at the cost of increased sophistication.

Model selection

While the multivariate structure (factor model) is largely dictated by estimation efficiency, the use of copula allows our univariate count-distributed marginals to be selected with a wide range of freedom. To choose our marginal models, AIC statistics were compared. We chose from the following classes of models.

  1. Negative Binomial

  2. Hagmark (Hagmark, P. 2008)

  3. Floor (Continuous variable rounded down)

Of all models tested, the Floor Log T was chosen because its AIC scores showed it to be robust and reliable.

The floor Log T univariate count model

The floor Log T univariate count model

The model

The following is a terse description of the entire model. A thorough description is provided in the thesis.

GPU computing

The high-dimensional nature of the problem brings common large-sample-size assumptions into question, effectively invalidating some popular statistical theory used in hypothesis testing. To understand the actual sampling distribution of my estimates, I've resorted to bootstrapping. Each maximum likelihood estimate is solved as a non-linear optimization program, which requires many calculations of the likelihood function. Worse yet, the large statistical model makes individual estimation time non-trivial, and thus bootstraps must be computed in parallel. Computational work is further increased, because the eventual precision-recall analyses are conducted over a full-factorial experimental design. These large computational tasks motivate GPU computing, so the likelihood function was implemented in CUDA.

GPU programming is constrained by warp divergence, GPU threads are executed in batches which should follow the same execution path. This makes popular numerical code libraries effectively ineligible for GPU usage. To meet this challenge, I used GPU-friendly numerical algorithms by implementing out-of-date methods. Since my software is implemented in single-precision floating point representation, older algorithms were sufficient. The best example of this is in the calculation of Student-t distributions, which are computed through beta integral ratios using the TOMS708 library (Didonato, A., and Morris, A. 1992, Brown, B. and Levy, L. 1994). The TOMS708 library diverges into many sub-cases in order to achieve numerical accuracy. Working in single precision allowed me to limit my code to only three branches, which can be managed as separate code blocks.

Conclusions

Precision-recall testing showed that my regularized model achieved the initial objectives of this project. The full-factorial experiment showed that the method is more sensitive to model misspecification than competing methods, however. Further application to real microbial data demonstrates that it is able extract individual correlations which expected to exist. Previous to this work, individual correlation investigation for our form of microbial data would have been highly erroneous.

References:

Weiss, S. et al. (2016). Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. ISME Journal, 10(7):1669-1681.

Hagmark, P. (2008). On construction and simulation of count data models. Mathematics and Computers in Simulation, 77: 72-80. 

Brown, B. and Levy, L. (1994). Certification of Algorithm 708: Significant Digit Computation of the Incomplete Beta Function Ratios, ACM Transactions on Mathematical Software, 20(3): 393-397.

Didonato, A., and Morris, A. (1992).  Algorithm 708: Significant Digit Computation of the Incomplete Beta Function Ratios, ACM Transactions on Mathematical Software, 18(3): 360-373.