2011-05-16

Whole-organism integrative expressome for C. elegans enables in silico study of developmental regulation

Thesis Defense: Whole-organism integrative expressome for C. elegans enables in silico study of developmental regulation


Author: Luke A. D. Hutchison
Co-Advisors: Prof. Isaac S. Kohane, Prof. Bonnie A. Berger

Date: Tuesday May 17, 2011
Time: 11am - 12:15pm
Location: MIT CSAIL Stata Center, Patil/Kiva seminar room, 32-G449

Short abstract:  [tl;dr]

The C. elegans nematode has been extensively studied as a model organism since the 1970s, and is the only organism for which the complete cell division tree and the genome are both available. These two datasets were integrated with a number of other datasets available at WormBase.org, such as the anatomy ontology, gene expression profiles extracted from 8000 peer-reviewed papers, and metadata about each gene, to produce the first ever whole-organism, cell-resolution map of gene expression across the entire developmental timeline of the organism, with the goal to find genomic features that regulate cell division and tissue differentiation. Contingency testing was performed to find correlations between thousands of gene attributes (e.g. the presence or absence of a specific 8-mer in the 3' UTR, the GC-content of the sequence upstream of the transcriptional start site, etc.) and thousands of cell attributes (e.g. whether cells that express specific genes die through apoptosis, whether cells become neurons or not, whether cells move in the anterior or posterior direction after division). The resulting database of contingency test scores allow us to quickly ask a large number of biologically-interesting questions, like, “Does the length of introns of expressed genes increase across the developmental timeline?”; “Across what period of development and in which cell types is this specific gene most active?”; “Do regulatory motifs exist that switch on or off genes in whole subtrees of the cell pedigree?”; “Which genes are most strongly implicated in apoptosis?”, etc. This whole-organism expressome enables direct and powerful in silico analysis of development.

Long Abstract:

The C. elegans nematode has been extensively studied as a model organism since the 1970s. C. elegans was also the first organism to have its genome fully sequenced, and it is the only organism for which the complete tree of cell divisions is known, from the zygote to the fully-developed adult worm. By integrating these two datasets with a number of other datasets available at WormBase.org, it is possible to start looking for a mapping from the C. elegans genome to its cell division tree, i.e. to identify genomic regulators of cell fate and cell phenotype.

Two different versions of the cell fate tree for C. elegans were linked and merged to maximize the metadata available for each cell, then the cell fate tree was cross-linked with the anatomy ontology, or hierarchical map of containment and relatedness of the worm's anatomical features. Reachability analysis was performed on the anatomy ontology to obtain a list of organs and tissue types that each cell is part of. A dataset of reported expression levels of thousands of genes in different tissue types and organs, as extracted from the gene expression results in 8000 peer-reviewed papers, was cross-linked with the anatomy ontology, and gene expression reported at tissue or organ level was propagated through the anatomy ontology to the individual cells that comprise those anatomical features. A gene metadata database was also integrated to provide metadata about the genes active in each cell. This combination of the two linked cell fate trees, the anatomy ontology, the gene expression database and the gene metadata database yields the first whole-organism, cell-resolution map of gene expression across the entire developmental timeline of the organism.

Given this integrated database of gene expression, contingency testing was performed to find correlations between thousands of different potential gene attributes (e.g. the presence or absence of a specific 8-mer in the 3' UTR, the GC-content of the sequence upstream of the transcriptional start site, etc.) and thousands of different potential cell attributes (e.g. whether cells that express specific genes die through apoptosis, whether they become neurons or not, whether they merge into syncitia, whether they move in the anterior or posterior direction after division). The resulting database of contingency test scores allow us to quickly ask a large number of biologically-interesting questions, like "Does the length of the introns of expressed genes increase across the developmental timeline?"; "Across what period of development and in which cell types is this specific gene most active?"; "Do regulatory motifs exist that switch on or off genes in whole subtrees of the cell pedigree?"; "Which genes are most strongly implicated in apoptosis?"; "Which genes cause cells to stop dividing and become leaf nodes in the cell pedigree?", etc. In querying for genes correlated with apoptosis in cells or daughter cells, for example, the database lists a large number of genes that have not previously been implicated in apoptosis. This whole-organism expressome enables direct and powerful in silico analysis of development on an unprecedented scale.

Finally, the increase in the amount of biological data being produced per year is far outstripping Moore's Law, but more importantly, language support for easily building large parallel data manipulation pipelines, like the one described above, is sorely lacking. As a result cores sit unused or programmers spend inordinate amounts of time manually parallelizing their code to make use of the available cores, which is error-prone. This is often termed "the multicore dilemma". The data transformation pipeline that integrates these various C. elegans data sources exhibited a number of repeating design patterns that directly gave rise to a new paradigm for building implicitly-parallelizing programming languages, known as Flow. The Flow paradigm is not central to the thesis research itself, but will be briefly described if there is time at the end of the defense.

2011-05-04

On Intel 3D chips and human 4D brains. (And flying cars.)

Intel today announced a breakthrough in making "3D chips". This is being misconstrued by many news outlets as meaning chips that are built out of 3D bricks of logic gates. Actually it's nothing of the sort. These chips are fundamentally 2D (or close to 2D) in layout, they just have conductive rails with U-shaped 3D cross sections. This makes for better transistors because the feature sizes they're working with today are already approaching the size of X-ray wavelengths, meaning it's getting harder and harder to manufacture conductive elements that are unbroken and not "blobby" using lithography. Giving the rails a U-shaped cross section triples the available conductive area, and gives you rails that are more uniform and less blobby in profile view, which means you get less random variance in the resistance of each wire, etc. So we have a factor of 3 improvement in area, and maybe a factor of 10 improvement in uniformity of current flow. That should give us, maybe, 10 more years' jump on Moore's Law ;)

Yes, this is a big manufacturing achievement, but as stated above, these are not 3D chips. However anything that can be done to get out of a single 2D plane is a huge step forward in terms of graph layout: you can't embed any graph that has a subgraph of K_3,3 (bipartite graph with 3 nodes in each part) or K_5 (completely connected graph with 5 nodes) into the 2D plane -- which is exactly why with roads, we have intersections and traffic jams -- except in the case of freeways, which to achieve unimpeded flow are often a spaghetti mess of raised bridges.  And this is also why we need flying cars.

So current chips already are not truly 2D or you simply could not produce them, it would be mathematically impossible. There are at least 3 layers of silicon (because that's how you create a transistor anyway). On a macro scale, most motherboards these days have something like 10-15 layers, because that's the only feasible way to lay all the wires. The more layers you can add, the less constrained you are in the layout. So we really just need to be figuring out ways to stack maybe 10-15 layers of silicon and we'd have a huge win.

I suspect that with truly 3D chips -- were you try to pack all the gates into a volume that's closer to a cube than a chip -- the biggest problems will be power density issues, and those issues will be major. We already can't wick away heat fast enough, and with today's (nearly-) 2D chips, the surface area to volume ratio (which is the critical aspect for heat transfer) is already maximized. So chips would have to run far more slowly if they were 3D in structure. (This is why the brain operates at something like 200 Hz and is massively parallel -- but it also only consumes something like 0.5W of power, so it is billions of times more energy efficient than today's computers). On the up-side, all the interconnected parts of a truly 3D chip would be physically closer together, so data path lengths would be much smaller than with current chips.

This is one huge advantage of the brain -- that wires don't have to deal with 2D crossings issues, and can dramatically reduce interconnect distances -- and this is also why if our brains were 4D they could operate much, much faster: in fact if we had 3D neurons embedded in a 4D space, all pairs of neurons could be interconnected with an axonal distance of close to zero :-)

From an energy density and energy efficiency standpoint though, the brain is looking more and more amazing all the time...

2011-05-03

Spoiler alert: Code to solve GoogleNexus Challenge #4

SPOILER ALERT: Don't read this if you haven't yet given up on GoogleNexus challenge #4!
It's a great puzzle, try to figure it out first.

--

Yesterday's GoogleNexus Challenge took you to this page which linked to this map of urban rail systems.  There were actually two puzzles embedded in the challenge: The first puzzle involved the colored circles at the top, which corresponded to line crossings, and the numbers corresponded to the index of the letter to take in the crossing name.  Reading off these letters gave LONDON.  Then there was a list of seemingly random words which are actually anagrams of London Tube stop names, only missing a letter.  The missing letters (one per line) spell out the answer, "Please send me a Nexus S so Google goodies I can access".

You could totally do this by hand if you're good with anagrams, but since comparing 44 anagrams (that are each missing a letter) to 306 tube stops is not my idea of fun, I wrote code to do it.

Here's the code I wrote to solve the puzzle (after copying the list of tube stops from Wikipedia).  I basically cross out one letter at a time  from the anagram in the station name, then output the single letter that results.

Don't judge my code, I wrote this at light speed...




import java.util.ArrayList;


public class Main {


    static String[] stations = {


    "Acton Town", "Aldgate", "Aldgate East", "All Saints", "Alperton", "Amersham", "Angel", "Archway", "Arnos Grove", "Arsenal",
            "Baker Street", "Balham", "Bank", "Barbican", "Barking", "Barkingside", "Barons Court", "Bayswater", "Beckton",
            "Beckton Park", "Becontree", "Belsize Park", "Bermondsey", "Bethnal Green", "Blackfriars [ 9 ]", "Blackhorse Road",
            "Blackwall", "Bond Street", "Borough", "Boston Manor", "Bounds Green", "Bow Church", "Bow Road", "Brent Cross",
            "Brixton", "Bromley-by-Bow", "Buckhurst Hill", "Burnt Oak", "Caledonian Road", "Camden Town", "Canada Water",
            "Canary Wharf", "Canary Wharf", "Canning Town", "Cannon Street", "Canons Park", "Chalfont & Latimer", "Chalk Farm",
            "Chancery Lane", "Charing Cross", "Chesham", "Chigwell", "Chiswick Park", "Chorleywood", "Clapham Common",
            "Clapham North", "Clapham South", "Cockfosters", "Colindale", "Colliers Wood", "Covent Garden", "Crossharbour",
            "Croxley", "Custom House", "Cutty Sark for Maritime Greenwich", "Cyprus", "Dagenham East", "Dagenham Heathway",
            "Debden", "Deptford Bridge", "Devons Road", "Dollis Hill", "Ealing Broadway", "Ealing Common", "Earl's Court",
            "East Acton", "East Finchley", "East Ham", "East India", "East Putney", "Eastcote", "Edgware", "Edgware Road",
            "Edgware Road", "Elephant & Castle", "Elm Park", "Elverson Road", "Embankment", "Epping", "Euston", "Euston Square",
            "Fairlop", "Farringdon", "Finchley Central", "Finchley Road", "Finsbury Park", "Fulham Broadway", "Gallions Reach",
            "Gants Hill", "Gloucester Road", "Golders Green", "Goldhawk Road", "Goodge Street", "Grange Hill",
            "Great Portland Street", "Greenford", "Green Park", "Greenwich", "Gunnersbury", "Hainault", "Hammersmith",
            "Hammersmith", "Hampstead", "Hanger Lane", "Harlesden", "Harrow & Wealdstone", "Harrow-on-the-Hill", "Hatton Cross",
            "Heathrow Terminals 1, 2, 3", "Heathrow Terminal 4", "Heathrow Terminal 5", "Hendon Central", "Heron Quays",
            "High Barnet", "Highbury & Islington", "Highgate", "High Street Kensington", "Hillingdon", "Holborn", "Holland Park",
            "Holloway Road", "Hornchurch", "Hounslow Central", "Hounslow East", "Hounslow West", "Hyde Park Corner", "Ickenham",
            "Island Gardens", "Kennington", "Kensal Green", "Kensington (Olympia)", "Kentish Town", "Kenton", "Kew Gardens",
            "Kilburn", "Kilburn Park", "King George V", "Kingsbury", "King's Cross St. Pancras", "Knightsbridge", "Ladbroke Grove",
            "Lambeth North", "Lancaster Gate", "Langdon Park", "Latimer Road", "Leicester Square", "Lewisham", "Leyton",
            "Leytonstone", "Limehouse", "Liverpool Street", "London Bridge", "London City Airport", "Loughton", "Maida Vale",
            "Manor House", "Mansion House", "Marble Arch", "Marylebone", "Mile End", "Mill Hill East", "Monument", "Moorgate",
            "Moor Park", "Morden", "Mornington Crescent", "Mudchute", "Neasden", "Newbury Park", "North Acton", "North Ealing",
            "North Greenwich", "North Harrow", "North Wembley", "Northfields", "Northolt", "Northwick Park", "Northwood",
            "Northwood Hills", "Notting Hill Gate", "Oakwood", "Old Street", "Osterley", "Oval", "Oxford Circus", "Paddington",
            "Park Royal", "Parsons Green", "Perivale", "Piccadilly Circus", "Pimlico", "Pinner", "Plaistow", "Pontoon Dock",
            "Poplar", "Preston Road", "Prince Regent", "Pudding Mill Lane", "Putney Bridge", "Queen's Park", "Queensbury",
            "Queensway", "Ravenscourt Park", "Rayners Lane", "Redbridge", "Regent's Park", "Richmond", "Rickmansworth",
            "Roding Valley", "Royal Albert", "Royal Oak", "Royal Victoria", "Ruislip", "Ruislip Gardens", "Ruislip Manor",
            "Russell Square", "St. James's Park", "St. John's Wood", "St. Paul's", "Seven Sisters", "Shadwell", "Shepherd's Bush",
            "Shepherd's Bush Market", "Sloane Square", "Snaresbrook", "South Ealing", "South Harrow", "South Kensington",
            "South Kenton", "South Quay", "South Ruislip", "South Wimbledon", "South Woodford", "Southfields", "Southgate",
            "Southwark", "Stamford Brook", "Stanmore", "Stepney Green", "Stockwell", "Stonebridge Park", "Stratford",
            "Sudbury Hill", "Sudbury Town", "Swiss Cottage", "Temple", "Theydon Bois", "Tooting Bec", "Tooting Broadway",
            "Tottenham Court Road", "Tottenham Hale", "Totteridge & Whetstone", "Tower Gateway", "Tower Hill", "Tufnell Park",
            "Turnham Green", "Turnpike Lane", "Upminster", "Upminster Bridge", "Upney", "Upton Park", "Uxbridge", "Vauxhall",
            "Victoria", "Walthamstow Central", "Wanstead", "Warren Street", "Warwick Avenue", "Waterloo", "Watford",
            "Wembley Central", "Wembley Park", "West Acton", "West Brompton", "West Finchley", "West Ham", "West Hampstead",
            "West Harrow", "West India Quay", "West Kensington", "West Ruislip", "West Silvertown", "Westbourne Park", "Westferry",
            "Westminster", "White City", "Whitechapel", "Willesden Green", "Willesden Junction", "Wimbledon", "Wimbledon Park",
            "Wood Green", "Wood Lane", "Woodford", "Woodside Park", "Woolwich Arsenal" };


    static String[] matches = { "renumber digits", "rub ink", "long hiatus", "big redskin", "damp heat", "calming moon",
            "third felon", "neutral pink", "technical flyer", "robs enemy", "sea honour", "saintly chef", "two cents",
            "old enchanter", "sparkle biz", "crocus fiord", "dusty brown", "ruby queen", "rich garcons", "sardine gland",
            "sans broker", "hated seaman", "not tensely", "gloved barker", "noted cavern", "evilest trooper", "scarlet esquire",
            "educators role", "manly beer", "carrot nubs", "thinks bigger", "hawthorn holler", "nocturnal howls", "diurnal gripes",
            "mad realtor", "gas alternate", "civil oratory", "orange press", "english carol", "tallest peahen", "rare lotus",
            "clan idol", "concerning torment", "darn heel" };


    public static void main(String[] args) {
        ArrayList<String> stops = new ArrayList<String>();
        for (String s : stations) {
            String sout = "";
            for (int i = 0; i < s.length(); i++) {
                char c = Character.toLowerCase(s.charAt(i));
                if (c >= 'a' && c <= 'z')
                    sout += c;
            }
            stops.add(sout);
        }


        for (String m : matches) {
            for (String s : stops) {
                if (s.length() == m.length()) {
                    StringBuffer buf = new StringBuffer(s);
                    for (int i = 0; i < m.length(); i++) {
                        char c = m.charAt(i);
                        if (c != ' ') {
                            int p = buf.indexOf(c + "");
                            if (p >= 0) {
                                buf.setCharAt(p, ' ');
                            }
                        }
                    }
                    String bs = buf.toString().trim();
                    if (bs.length() == 1)
                        System.out.print(bs);
                }
            }
        }
    }
}




2011-05-02

Is the MLK quote fake? "I mourn the loss of thousands of precious lives..."

(or, "On not rejoicing in Osama bin Laden's death"...)

The following quote spread all over the Internet in the last 24 hours:

"I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy. Returning hate for hate multiplies hate, adding deeper darkness to a night already devoid of stars. Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that"
— Martin Luther King Jr.


Megan McCardle, editor for The Atlantic, posted a blog entry saying that MLK "probably" didn't say this.  However, MLK said everything but the first sentence.



The original source of the mis-quote was finally tracked down.


This was just a copy/paste error: somebody made the statement in the first sentence -- which was their own assessment of the situation -- and then pasted MLK's quote below it.  Then somebody else copied the whole thing, and posted it to facebook, joining the two quotes together so it looked like it was all attributed to MLK. Then it spread all over the Internet, and then the followup spread all over the Internet to say that it was fake.  Which it is not -- it should just be split into two pieces like this:

"I mourn the loss of thousands of precious lives, but I will not rejoice in the death of one, not even an enemy."
— Jessica Dovey
"Returning hate for hate multiplies hate, adding deeper darkness to a night already devoid of stars. Darkness cannot drive out darkness: only light can do that. Hate cannot drive out hate: only love can do that."
— Martin Luther King Jr. [source, Google Books]

Even though MLK didn't say the first part, Jessica's comments are eloquent and I completely agree with them.

While I'm on the subject, here are a couple of other related quotes:

"No man is an island, entire of itself; every man is a piece of the continent, a part of the main. . . any man's death diminishes me, because I am involved in mankind" —John Donne, Meditation XVII


"Always forgive your enemies. Nothing annoys them more." --Oscar Wilde