Sheffield Statistical Genomics Workshop 2014

Statistical Genomics Workshop

11th April 2014

Sponsored by the EU FP7 RADIANT Project.


Over the past few years, advances in biotechnology have allowed scientist to gather unprecedented amounts of genome-scale data. At the same time, advances in high-throughput phenotyping and the digitization of clinical charts have allowed a more precise quantification of disease severity. The combination of these two aspects results in an increase of the volume of the data being produced, that while allowing the dection of rare and weak signals, makes the application of complex models much more challenging, limiting the insight that can be gained. A second consequence of these advances in technology is the increase in the variety of data sources that are available to study an organism. Looking at each data type in isolation is very often underpowered, compared to an approach that is able to integrate and jointly analyze different types of data (gene expression, genotype, epigenotype, clinical charts, imaging data).

The focus of this workshop will be on the mathematical and computational aspects of dealing with big data in the biological sciences, both in terms of allowing to fit more complex models, and to extract more insight from available data. In particular, the goal is to present approaches that are able to combine weak evidence from multiple studies and integrate multivariate, high dimensional data types. Furthermore, while the complexities of the data call for richer and more complex models, these have often found limited applicability in practice, because they are considered too complicated, their implementation is inadequate or it's hard to assess their relative performances due to the lack of benchmarks.


Friday 11th April

8:30-9:00 Arrivals
9:00-9:50 Causal Inference and Statistical Learning [Lecture Slides]
 Bernhard Schölkopf, Max Planck Institute for Intelligent Systems Tübingen
Causal inference is an intriguing field examining causal structures by testing their statistical footprints. The talk introduces the main ideas of causal inference from the point of view of machine learning, and discusses implications of underlying causal structures for popular machine learning scenarios such as covariate shift and semi-supervised learning. It argues that causal knowledge may facilitate some approaches for a given problem, and rule out others.
10:00-10:50 Incorporating covariates in statistical analysis of biological systems [Lecture Slides]
 Tom Thorne, University of Edinburgh
11:00-11:30 Coffee Break
11:30-12:20 Latent variable models to account for heterogeneity between individuals and single cells [Lecture Slides]
 Oliver Stegle, EBI, Cambridge
12:30-13:30 Lunch Break
13:30-14:20 Multi-view learning for genomic data integration and drug sensitivity prediction [Lecture Slides]
 Sami Kaski, Aalto University and University of Helsinki
I will discuss two problems where joint analysis of multiple genomic data sources is needed. The first is drug sensitivity prediction in personalized medicine. This is a supervised learning task which can be addressed by a combination of multi-view and multi-task learning. Alternatively, it can be viewed as a structured prediction task or a recommender system. I will discuss an approach which has recently turned out to be successful, Bayesian kernelized multi-view multi-task methods for predicting sensitivities across drug profiles, and its generalizations to matrix factorization with side information. The second task is unsupervised search of unknown connections between data sources, for which we have introduced Bayesian canonical correlation analysis-based methods, and recently Group Factor Analysis (GFA) which generalizes factor analysis from analysing relationships of univariate variables to analysis of multiple data sources each consisting of multivariate observations.
14:30-15:00 Tea Break
15:00-15:50 Geographic Population Structure (GPS) of worldwide human populations infers biogeographical origin down to home village [Lecture Slides]
 Eran Elhaik, University of Sheffield, UK
The search for a method that utilizes biological information to predict human's place of origin has occupied scientists for millennia. Modern biogeography methods are accurate to 700 km in Europe but are highly inaccurate elsewhere, particularly in Southeast Asia and Oceania. The accuracy of these methods is bound by the choice of genotyping arrays, the size and quality of the reference dataset, and principal component (PC)-based algorithms. To overcome the first two obstacles, we designed GenoChip, a dedicated genotyping array for genetic anthropology with an unprecedented number of ~12,000 Y-chromosomal and ~3,300 mtDNA SNPs and over 130,000 autosomal and X-chromosomal SNPs carefully chosen to study ancestry without any known health, medical, or phenotypic relevance. We also 615 individuals from 54 worldwide populations collected as part of the Genographic Project and the 1000 Genomes Project. To overcome the last impediment, we developed an admixture-based Geographic Population Structure (GPS) method that infers the biogeography of worldwide individuals down to their village of origin. GPS's accuracy was demonstrated on three data sets: worldwide populations, Southeast Asians and Oceanians, and Sardinians (Italy) using 40,000-130,000 GenoChip markers. GPS correctly placed 80%; of worldwide individuals within their country of origin with an accuracy of 87%; for Asians and Oceanians. Applied to over 200 Sardinians villagers of both sexes, GPS placed a quarter of them within their villages and most of the remaining within 50 km of their villages, allowing us to identify the demographic processes that shaped the Sardinian society. These findings are significantly more accurate than PCA-based approaches. We further demonstrate two GPS applications in tracing the poorly understood biogeographical origin of the Druze and North American (CEU) populations. Our findings demonstrate the potential of the GenoChip array for genetic anthropology. Moreover, the accuracy and power of GPS underscore the promise of admixture-based methods to biogeography and has important ramifications for genetic ancestry testing, forensic and medical sciences, and genetic privacy.


Registration will be via the main MASAMB registation page
This document last modified Sunday, 13-Apr-2014 07:04:47 BST