Protein sequence landscapes : Using massive genomic data for statistical inference, structure prediction and de novo design
Martin Weigt, Pierre et Marie Curie / Sorbonne Université - will talk about generative models from protein sequence data and interesting concrete applications.
Abstract: Over the last years, biological research has been revolutionized by experimental high-throughput techniques. Unprecedented amounts of data are accumulating, causing an urgent need to develop data-driven modeling approaches to unveil information hidden in raw data, thereby helping to increase our understanding of complex biological systems. To give a specific example, proteins show a remarkable degree of structural and functional conservation over billions of years of evolution, despite their large variability in amino-acid sequences.
Thanks to modern sequencing techniques, this amino-acid variability is easily observable, contrary to time- and labour-intensive experiments determining, e.g., the three-dimensional fold of a protein or its biological functionality. He will present recent developments around the so-called Direct-Coupling Analysis (DCA), a statistical-inference approach linking sequence variability to protein structure and function. He will show that DCA can be used (i) to infer contacts between residues and thus to guide 3D-structure prediction of proteins and their complexes and to reconstruct mutational landscapes and thus to predict the effect of mutations.
Beyond these direct inference tasks, he will present evidence that our models can be used to develop novel approaches to data-driven de novo protein design.