12. Scientific background
The scientific background should be a short introduction to the available literature and the scientific gap that the current proposal is going to fill. Please note that the maximum number of words for this section is 1500.
Systematic collection of patient and population cohorts provide an opportunity for data-driven discovery of environmental and genetic predisposing conditions and disease subtyping[1,2]. Unsupervised machine learning (ML) methods are able to identify complex and multifactorial relationships among variables, termed latent factors. The loadings onto such factors for any individual can be interpreted as a latent phenotype, which can be used to improve the power of genetic association studies, both by reducing the noise in the observed phenotype, and by disentangling pleiotropic phenotypes[3].
So far however, such models have a number of shortcomings. First, the rich longitudinal aspect of many data sets is generally ignored. Neither healthy children or elderly with common diseases are very informative about relevant risk factors; conversely, the lifestyle or genetic profile of children with say type-2 diabetes or of healthy centenarians can be very informative about determinants of disease. Second, a "kitchen sink" approach is often used, where a range of standard ML approaches are tested and the best is chosen based on cross-validation and hold-out data[4]. While this approach is sound in data-rich environment, patient and population data are expensive to collect and efficient use of data is desirable. Third, missing data is often a significant issue, and existing methods are not geared towards high rates of missingness.