Fri, Jan. 5, 2018, 3:15pm
Edward C. Taylor Auditorium, Frick B02
Host: Annabella Selloni
Learning models from data – applying covariance analysis to sequences and molecules to infer structure and functional interactions
My research program aims to find novel ways of using large data sets to infer protein 3D structure and also functional interactions between proteins and the ligands they bind to. Recent work has focussed on covariance analysis, inferring interactions between sites by analyzing datasets of molecules with similar functional properties. The variables, or sites are typically sequence positions in proteins, or cheminformatic descriptor elements in small molecules. Previously, we demonstrated that protein sequence alignments can be used to accurately predict protein tertiary and quaternary structure. More recently we have shown that these methods can be extended to accurately predict binding between protein interactions partners, even in the absence of experimentally acquired training data. In general, the predicted signal from covariance alone is too noisy to predict functional interactions (despite recent claims to the contrary). For protein sequence data, a major contribution to this noise is phylogeny, which causes spurious covariance between amino acid sites due to their phylogenetic relation. We have recently discovered that methods from random matrix theory can be applied to remove these phylogenetic artifacts from the inferred couplings, promising significant methodological improvements. For ligands, we have shown that random matrix theory can be used in conjunction with classical chemical fingerprints to predict those molecular fragments that enable ligands to bind to proteins. We have recently developed a software tool — envision — for extracting such predictions from Chembl. Finally, I will discuss current efforts to further enhance these methods using recent advances in deep learning.