2011-05-16

Whole-organism integrative expressome for C. elegans enables in silico study of developmental regulation

Thesis Defense: Whole-organism integrative expressome for C. elegans enables in silico study of developmental regulation


Author: Luke A. D. Hutchison
Co-Advisors: Prof. Isaac S. Kohane, Prof. Bonnie A. Berger

Date: Tuesday May 17, 2011
Time: 11am - 12:15pm
Location: MIT CSAIL Stata Center, Patil/Kiva seminar room, 32-G449

Short abstract:  [tl;dr]

The C. elegans nematode has been extensively studied as a model organism since the 1970s, and is the only organism for which the complete cell division tree and the genome are both available. These two datasets were integrated with a number of other datasets available at WormBase.org, such as the anatomy ontology, gene expression profiles extracted from 8000 peer-reviewed papers, and metadata about each gene, to produce the first ever whole-organism, cell-resolution map of gene expression across the entire developmental timeline of the organism, with the goal to find genomic features that regulate cell division and tissue differentiation. Contingency testing was performed to find correlations between thousands of gene attributes (e.g. the presence or absence of a specific 8-mer in the 3' UTR, the GC-content of the sequence upstream of the transcriptional start site, etc.) and thousands of cell attributes (e.g. whether cells that express specific genes die through apoptosis, whether cells become neurons or not, whether cells move in the anterior or posterior direction after division). The resulting database of contingency test scores allow us to quickly ask a large number of biologically-interesting questions, like, “Does the length of introns of expressed genes increase across the developmental timeline?”; “Across what period of development and in which cell types is this specific gene most active?”; “Do regulatory motifs exist that switch on or off genes in whole subtrees of the cell pedigree?”; “Which genes are most strongly implicated in apoptosis?”, etc. This whole-organism expressome enables direct and powerful in silico analysis of development.

Long Abstract:

The C. elegans nematode has been extensively studied as a model organism since the 1970s. C. elegans was also the first organism to have its genome fully sequenced, and it is the only organism for which the complete tree of cell divisions is known, from the zygote to the fully-developed adult worm. By integrating these two datasets with a number of other datasets available at WormBase.org, it is possible to start looking for a mapping from the C. elegans genome to its cell division tree, i.e. to identify genomic regulators of cell fate and cell phenotype.

Two different versions of the cell fate tree for C. elegans were linked and merged to maximize the metadata available for each cell, then the cell fate tree was cross-linked with the anatomy ontology, or hierarchical map of containment and relatedness of the worm's anatomical features. Reachability analysis was performed on the anatomy ontology to obtain a list of organs and tissue types that each cell is part of. A dataset of reported expression levels of thousands of genes in different tissue types and organs, as extracted from the gene expression results in 8000 peer-reviewed papers, was cross-linked with the anatomy ontology, and gene expression reported at tissue or organ level was propagated through the anatomy ontology to the individual cells that comprise those anatomical features. A gene metadata database was also integrated to provide metadata about the genes active in each cell. This combination of the two linked cell fate trees, the anatomy ontology, the gene expression database and the gene metadata database yields the first whole-organism, cell-resolution map of gene expression across the entire developmental timeline of the organism.

Given this integrated database of gene expression, contingency testing was performed to find correlations between thousands of different potential gene attributes (e.g. the presence or absence of a specific 8-mer in the 3' UTR, the GC-content of the sequence upstream of the transcriptional start site, etc.) and thousands of different potential cell attributes (e.g. whether cells that express specific genes die through apoptosis, whether they become neurons or not, whether they merge into syncitia, whether they move in the anterior or posterior direction after division). The resulting database of contingency test scores allow us to quickly ask a large number of biologically-interesting questions, like "Does the length of the introns of expressed genes increase across the developmental timeline?"; "Across what period of development and in which cell types is this specific gene most active?"; "Do regulatory motifs exist that switch on or off genes in whole subtrees of the cell pedigree?"; "Which genes are most strongly implicated in apoptosis?"; "Which genes cause cells to stop dividing and become leaf nodes in the cell pedigree?", etc. In querying for genes correlated with apoptosis in cells or daughter cells, for example, the database lists a large number of genes that have not previously been implicated in apoptosis. This whole-organism expressome enables direct and powerful in silico analysis of development on an unprecedented scale.

Finally, the increase in the amount of biological data being produced per year is far outstripping Moore's Law, but more importantly, language support for easily building large parallel data manipulation pipelines, like the one described above, is sorely lacking. As a result cores sit unused or programmers spend inordinate amounts of time manually parallelizing their code to make use of the available cores, which is error-prone. This is often termed "the multicore dilemma". The data transformation pipeline that integrates these various C. elegans data sources exhibited a number of repeating design patterns that directly gave rise to a new paradigm for building implicitly-parallelizing programming languages, known as Flow. The Flow paradigm is not central to the thesis research itself, but will be briefly described if there is time at the end of the defense.

4 comments:

  1. Sounds fascinating, Luke. Can't wait to read it.

    ReplyDelete
  2. Thanks Carl -- honestly I wouldn't wish reading the whole long thesis on anybody ;) Once it's done I'll be turning it into paper form and it'll be more compact and digestible at that point.

    ReplyDelete
  3. As you complete your thesis and move to bigger things, may the focus of your intent remain on the greater will, and may all that you do bring beauty and joy to the world. You will go as far as you choose. May the force be with you!!
    Congratulations Luke.
    Holley

    ReplyDelete
  4. Thanks Holley for the inspiration!

    ReplyDelete