Achievement unlocked: got my green card -- and fixing the immigration system

Major life update: I received the notification email this morning that my green card was approved. This ends a 12-year saga of the US taxpayer spending literally millions on my education but the US government always threatening to boot me out once it was all over. (Please, Congress, pass some serious immigration reform soon, instead of educating foreign nationals at the best schools and then sending them home to compete with the US.) A green card at least gives me freedom to work wherever I want whenever I want, even if I'm not a full citizen yet -- and I don't have to worry about a visa expiring.

As a foreign student, it has been pretty stressful to feel like you're "living in exile" for many years without the full rights and freedoms of citizens of the country. It has felt like a psychological burden to be even mildly freedom-restricted for many years. (The closest description I know of what this burden feels like is Longfellow's poem about unfinished TODO items, Something Left Undone.)

Also, as a result of having gone through this, I'm incensed at the treatment of people that have a much more difficult immigration situation than I do: the very people who keep this country running, by working all the mundane, hard jobs that nobody else wants to do, and for many of whom this is the only home they have ever known -- and yet, these are people who are not recognized as "real Americans" by the US government.

I hope for the future of this nation that these problems are fixed. Let's treat all valuable long-time contributors to this country as real people, by making them citizens, and let's staple a green card to every advanced degree, so that the US isn't simply training up the next generation of global competition. It will be much too late by the time the US realizes it has definitively lost its edge as world leader 15 years from now.


How to protect kids from objectionable content online

The nephew of a friend of mine is 9 years old and has been recently demonstrating his ability to search for almost anything online. The friend asked about how to protect a nephew from inappropriate content. Here's my response, in case it is helpful to anybody else:

It's almost impossible to block this stuff, and kids will get around almost any filter you can install, but there is a lot you can do to protect them. The two most important points are to reduce accidental exposure, and to create a healthy relationship between kids and parents, so that there is an open dialog and a non-guilt-ridden climate around the topic.

In more detail, you need a multi-pronged approach to make any headway with this issue of protecting kids:
  1. Good parenting: this is the most important aspect of the issue, obviously. This includes talking to them about this stuff and not "criminalizing" things to the point where they will hide: they need to feel free to talk to their parents about everything.
  2. Putting the computer in a public thoroughfare (like the living room) can help.
  3. Install Google Chrome, and remove icons for the Internet Explorer browser on the Desktop or in the Quick Launch toolbar or in the recent apps in the Start menu (just leave one icon in the Start menu if think you absolutely need it). Kids should use Chrome because they are less likely to get hit by "drive-by" virus installations that pop up porn when they visit random sites on the Internet -- Internet Explorer is extremely vulnerable to virus infection.
  4. In Chrome, disable pop-ups and Flash, since this is how a lot of unintentional exposure to porn happens: go to Settings -> Show Advanced Settings -> Privacy -> Content Settings; from there, set Plug-ins to "Click to Play", and set Pop-ups to "Do not allow any sites to show pop-ups". Now to view any site that has Flash content, you'll get a big gray box that you have to manually click on to view the content (e.g. on YouTube). Switch on HTML5 video mode in YouTube to make it a little less annoying to see videos in YouTube (so you don't have to click to view Flash), unless you're trying to make it harder in general for kids to see Flash content.
  5. Install the AdBlock extension for Chrome, since ads are a big source of accidental exposure to inappropriate content.  
  6. [This one is usually missed by people]: Switch your DNS settings to use OpenDNS. This will make it impossible to browse to most adult-themed websites without circumvention measures, even if the kid can still find bad content in Web search results.
  7. Set up a Gmail account for the kid, and make sure the account is logged in, then lock the safesearch settings in Google search and in YouTube.  If you really care about locking safesearch down, you can pay for SafeSearchLock, which locks down safe search across numerous search engines. It only costs £2, or about $3.
  8. Finally, you could install NetNanny.  I recommend this last, because a lot of the above things can help a huge amount without resorting to filtering, and if parents trust in a filter, it can give a false sense of security. Filtering is authoritarian, which can create frustration and a feeling of inequality and injustice in kids. Filters are also porous and pretty easy to subvert if the kid is really determined. But in spite of these caveats, these days filters like NetNanny do work pretty well at both making it almost impossible to be accidentally exposed to inappropriate material in the browser, and making it very hard to intentionally browse to inappropriate material too without a parent manually typing in a password. Nevertheless, be aware of the Streisand Effect, wherein attempting to suppress information can result in the opposite of the desired effect. Also consider that if you tell kids, "don't look in the box" or "don't think about pink elephants", what are they going to do? 
That covers most of the bases. I hope this is helpful!


How to synchronize collaborative music performances with Google+ Hangouts

Someone just asked me via email how she can synchronize her singing over a Google+ Hangout with a musician on the other end, when it seemed like there was a time delay that was tripping them up. She wanted to know how to eliminate the time delay, or if Google was planning to eliminate it at some point.  Here is my reply, posted here in case it helps somebody else.


Great question! It was my mother's birthday last week, and the family got together in four different venues over Google Hangout to sing "When I'm 64" to her for her birthday :-)  I called it before we even started the hangout: the latency (delay) would cause us to all keep slowing down to let each other catch up, then realize that everybody was getting further behind, so we all needed to speed up and skip ahead, and then we would slow down again, etc.  Sure enough, every few seconds we seemed to have singing synchronization issues. It made the whole thing a lot funnier, but it wouldn't work for your situation at all!

In the general case, this is not solvable for the same reasons that Einstein said that all simultaneity is relative: when it takes a non-zero amount of time to send information from point A to point B, and back again to point A, it's impossible for both point A and point B to agree on a global concept of "now". You simply cannot reduce the latency to zero for network connections, and much less for running a complex streamed application like a Hangout over the network, and the further apart you are in the world, the greater the expected latency.

The way that this has been solved in the past (e.g. by that massive virtual orchestra / virtual choir project that has been run over YouTube a couple of times before) was to pre-record the music, and have each singer play the sound in their headphones while singing / playing. Then they each separately recorded their videos and sent them to someone who mixed them down into a single track, offline, after they had all finished recording their separate tracks.  i.e. they simply avoided the problem entirely by not performing simultaneously :-)

If I were you, I would simply experiment with performing simultaneously, maybe you can practice having one of you (the one on the recording end) singing exactly on time with the other person, and the other person playing/singing at exactly 2x the one-way delay time ahead of the other person. The trick would be to have the performer that is playing ahead (not on the recording end) set the tempo and basically pay no attention to the person on the recording end (i.e. don't try to slow down to let them catch up). As long as the person on the recording end is on-time and keeps up with the one that is leading the piece, nobody will know about the synchronization issues.

If you're not recording locally, but rather broadcasting the Hangout live, you both need to split the delay equally, so that each of you sings/plays at exactly 1x the one-way time delay ahead of the other person (or ahead of what you hear coming out of your speakers). Actually, you probably need to play 0.5x the latency ahead of what you hear coming out of your speakers, because each connection is routed through Google's servers and then back out to the other person, and it's from Google's servers that the two different video signals are mixed and then broadcast out to the rest of the world.

I hope this makes sense. There's really no way around this for live hangouts though! (But you might be able to make it work for recordings.)


More on leaving academia

The problem with being a grad student is that the burden of an unfinished thesis or dissertation, or paper deadlines, or assignment deadlines, never goes away. You never feel quite free to walk away from it on nights and weekends. It reminds me of the poem "Something Left Undone":

Labor with what zeal we will, 
Something still remains undone, 
Something uncompleted still 
Waits the rising of the sun. 

By the bedside, on the stair, 
At the threshhold, near the gates, 
With its menace or its prayer, 
Like a mendicant it waits; 

Waits, and will not go away; 
Waits, and will not be gainsaid; 
By the cares of yesterday 
Each to-day is heavier made; 

Till at length the burden seems 
Greater than our strength can bear, 
Heavy as the weight of dreams 
Pressing on us everywhere. 

And we stand from day to day, 
Like the dwarfs of times gone by, 
Who, as Northern legends say, 
On their shoulders held the sky. 

--Henry Wadsworth Longfellow

Your career trajectory in academia depends upon building your reputation (through your publication record, service rendered to the Ivory Tower, and awards received). It's a game some people can play well, but it's a hard game to keep playing year after year until your previous successes carry you on to your next successes with very little personal effort beyond showing up to as many meetings and speaking engagements as you can on your completely overbooked calendar while trying to stay sane.

Ultimately I got out of academia because I realized that I had just spent seven years just on my PhD and postdoc, which was 8% of my expected lifespan or something, and the work didn't feel like it justified 8% of my life. Life is precious.

However, academia is in my blood, in my DNA. I'll be back, I'm sure.


How to change timezone settings on Gmail, Google Calendar etc.

Every time I go back and forth between the East and West coast of the US, I have to google how to change timezones on Google products, because (1) the settings are hard to find, and (2) you have to do no fewer than SIX different things to avoid all sorts of weird timezone-related glitches:
  1. Update the timezone setting for your computer's system clock -- this is important because SOME but not all Google products use the system timezone (e.g. Gmail reads the system time zone, as do some but not all features in Google Calendar). The option to change the timezone can be found by right-clicking on the clock in the system tray and choosing a Preferences or Time/Date Settings option or similar. (In Fedora Linux, you can also type "system-config-date" in a console.) Make sure "System clock uses UTC" is checked so that Daylight Savings time is handled correctly.
  2. Restart Chrome (or whichever browser you use) -- it doesn't pick up the timezone change of your system clock until it has restarted, even if you log out of your Google accounts and log back in. Gmail will continue to display message timestamps in the old timezone until the browser is restarted, regardless of the timezone setting on your Google account (see the last point below).
  3. In Google Calendar, you have to manually change the display timezone. Go to the gear menu near top right, choose Settings, then under the default tab (General), change the setting "Your current time zone".
  4. If you use Google Voice, it has its own timezone setting too: Go to the gear menu at top right, choose Settings, click the Account tab, and change the Time Zone setting there.
  5. Google Drive now has its own timezone setting too, although it is unset by default (which, I assume, means it uses the system timezone? Or maybe the Google account timezone, described below?): Go to the gear menu at top right, choose Settings, and in the General tab you'll see Time Zone. (I guess if it's unset, leave it unset, hopefully it will use one of the other correctly-set time zones.)
  6. (The really hard one to find): Go to https://security.google.com/settings/security/contactinfo , and under "Other emails", click Edit, which will probably take you to the more obscure-looking URL https://accounts.google.com/b/0/EditUserInfo?hl=en . (You can go directly to that second link, but it looks like the sort of URL that won't be around forever, so I provided the navigation directions.) Here you'll find a timezone setting that, if wrong, can mess with timezone display in strange places in Google products (e.g. the red line showing current time on Google calendar may appear at the wrong time, or events are shown at the wrong time, or popup reminders pop up at the wrong time, but not all of these -- I forget exactly what weird effects were caused by this problem exactly). This setting says it's optional, so you can probably select the blank line from the dropdown list, effectively unsetting an account-specific timezone. I'm hoping then that one of the two timezone settings above will be used instead, but I need to test this. 
(Probably Google apps should all read the browser's timezone, and the browser should watch for updates of the system clock and timezone. Hopefully this gets fixed someday...)

I hope this saves somebody some frustration!


On the nouveaux érudits and anti-theism

Since the rise of a few prominent and vocal atheists about five years ago, public anti-theist sentiment has increased dramatically. The nouveaux érudits are successfully introducing into the public subconscious the idea that theism as a liability. These days, everybody I know that believes in God in some way goes about it relatively quietly. The public scoffing at religion is unfortunate, because removing religion from public discourse robs humanity of an important aspect of human cultural identity and diversity. Just like in the natural world, diversity is at the root of the beauty of humanity.

Atheism is as much a belief system as any religion I can think of. Both theism and atheism is a belief and a choice. Unfortunately neither theism nor atheism constitutes a falsifiable hypothesis based on tangible, observable evidence, so we should all simply agree to coexist happily while respecting others' beliefs, since arguing, belittling, attacking etc. doesn't actually do anybody any good. I don't understand how much effort some people put into tearing down others' beliefs.

That applies to both theists and atheists, incidentally -- there is no justification for, or usefulness in, fighting with a person or trying to coerce them to believe the way you do in the name of either "saving their soul" or "protecting them from religion". Share what you're excited about by all means, if somebody wants to listen, but respect people's agency, and once they have heard what you claim as fact, let them figure out for themselves whether or not what you are saying is true. And you can teach people what you believe without tearing down what they believe. Or as the Qu'ran rather ironically says, "Let there be no compulsion in religion -- truth stands out clear from error."

Back to my original point though -- it is sad that the religious people I know generally don't feel that they can present this as part of their public identity anymore.

(In response to Nils Hitze's post on Google+ wondering why most people he follows are atheist. The above is also posted to my G+ feed, so there may be comments there too.)


Results before Methods??

Why do so many journals require you to put results before methods? That's so backwards, and it is maddening to have to write a paper to conform to. You can't explain what you found before you explain how you found it. This also betrays the mindset of many biologists that I know at least (since this pattern is ubiquitous in biology and biomedical journals), that you can safely ignore the contextual assumptions embedded in the methods when you interpret the results.


On leaving academia, and wanting to create "Google X" without the Google part

[ tl;dr version: I'm leaving academia after many years; have big life decisions to make; need a dose of perspective. What's a good next step? What's the likelihood of success of creating some sort of standalone Google X type lab with a few brilliant people and getting it funded in the current climate? ]

So I'm finally biting the bullet and leaving academia, after a nasty realization that the powers that be (the hand that was supposed to feed me, NSF, cough) -- in conversations right to the top of the hierarchy -- have insufficient technical understanding to tell sound ideas from the rest, and insufficient foresight to take a risk on funding potentially revolutionary ideas when there's an evolutionary idea from a good friend of theirs also submitted in the same round of proposals that lets them check off their keyword boxes.

As a result, I have started gingerly digging through some job listings on both coasts. When you've been in academia as long as I have (and had it drummed into you that "you will never amount to anything in this world, or accomplish anything much, if you leave academia -- and you'll never come back if you leave"), it's hard to look forward to being a code monkey. I have, however, worked as a software engineer several times, for several different companies, so it's not like I have never had a "real job", but I figured I should ask for a dose of perspective. Tell me that academia has been lying to me all these years :-)

Really, I know that the perception that academia is the be-all and end-all of innovation and world-saving is for the most part false or at least myopic, and nobody outside academia really sees it that way, even though some cool research comes out of all the major institutions each year. Academia is certainly not the be-all and end-all of wealth creation. Peter Diamandis (whom I know from Singularity University) once expressed complete disbelief that I would even be considering a career in academia -- "You'll amount to 1/10th of your potential if you stay in academia."). Deep down I know he's right, even if I still have a strong urge to at least keep my foot in the door.

Anyway, I also have a strong entrepreneurial streak (I have a doc I have been compiling over the last few years, that now consists of hundreds of different ideas, some of which might even succeed to some degree if the execution is good) but I also have a strong dislike for business operations (although I have been involved in a small startup before). There seems to be a lot of funding out there right now, and a resurgence of new ideas, as well as more risk-taking than the industry has seen for a decade or more. I think this might be the right time to try to jump into entrepreneurship rather than working for the man.

So, I guess the questions I have for anybody that stumbles across this post are:

(1) Does anybody have experience with leaving academia and getting back into it after having accomplished something useful in industry? Is it even worth trying to keep that option open if my greatest interests are all heavily research-oriented (and when I would go crazy with mundane coding), or are there all-round better alternatives? Does anybody have experience working somewhere like PARC or one of the Intel research labs?

(2) How hard would it be to create a new research lab with a few top-notch guys, as a standalone "skunkworks" type lab but based outside of any organization or company with deep pockets, and get it funded today? (i.e. something like the Google X lab, but run outside of Google -- this would obviously be a big investment risk if the focus is even partly on blue-skies projects.)  Is working with the type of employee that is frequently attracted to that sort of working environment (i.e. top 1% of engineers, theorists, inventive types etc. with all the crazy ideas and the audacity to think that they might be able to make them work) like herding cats? Has anybody worked at Google X or in a similar moonshot research lab?

I guess I'm just trying to figure out where I can make the biggest difference. Or maybe even thinking that way is a delusional after-effect of standing on the Ivory Tower :-)

PS in case anybody is hiring...  http://resume.lukehutch.com/


The multicore dilemma (in the big data era) is worse than you think

Cross-posted from the Flow blog:

I'm currently applying for a grant to work on the Flow programming language. I thought I would post the following excerpt from the abstract and the background section, particularly because it presents a number of aspects of the multicore dilemma that are not discussed frequently enough.


The arrival of the big data era almost exactly coincided with a plateau in per-core CPU speeds and the beginning of the multicore era, and yet, as recently stated by the National Research Council, no compelling software model for sustaining growth in parallel computing has yet emerged. The multicore dilemma, therefore, presents the biggest near-term challenge to progress in big data. We argue that the multicore dilemma is actually a substantially worse problem than generally understood: we are headed not just for an era of proportionately slower software, but significantly buggier software, as the human inability to write good parallel code is combined with the widespread need to use available CPU resources and the substantial increase in the number of scientists with no CS background having to write code to get their job done. The only acceptable long-term solution to this problem is implicit parallelization -- the automatic parallelization of programmer code with close to zero programmer effort. However, it is uncomputable in the general case to automatically parallelize imperative programming languages, because the data dependency graph of a program can’t be precisely determined without actually running the code, yet most “mere mortal” programmers are not able to be productive in functional programming languages (which are, in fact, implicitly parallel). We need a new programming model that allows programmers to work in a subset of the imperative style while allowing the compiler to safely automatically parallelize our code.


The arrival of the big data era has almost exactly coincided with a plateau in per-core CPU speeds [see figure below]. This may be the biggest near-term challenge in the big data era.

The intent of the background provided here is not to simply rehash what is already widely known about the apparent demise of Moore’s Law or the arrival of the big data era, but to describe an uncommonly-discussed set of issues that arise due to the co-arrival of these two phenomena, and to demonstrate the vital importance in this context of lattice-based computing, our specific proposed method for solving the multicore dilemma, which safely adds automatic parallelization to a maximal subset of the imperative programming paradigm in order to be accessible to the average programmer.

Processor performance from 1986 to 2008 as measured by the benchmark suite SPECint2000 and consensus targets from the International Technology Roadmap for Semiconductors for 2009 to 2020. The vertical scale is logarithmic. A break in the growth rate at around 2004 can be seen. Before 2004, processor performance was growing by a factor of about 100 per decade; since 2004, processor performance has been growing and is forecasted to grow by a factor of only about 2 per decade. Source: [Fuller & Millett 2010]

We can’t “solve” big data without solving the multicore dilemma

The only acceptable path forward for big data is multicore (and cluster) computing, therefore problems in big data cannot be separated from the multicore dilemma [1]: the only way to handle an exponential increase in data generated each year is to exponentially increase our efficient usage of parallel CPU core resources.

We can’t solve the multicore dilemma without solving automatic parallelization

A great many techniques have been developed for parallel computing, however all of them are too error-prone, too unnatural for the programmer to use without significant design effort, require too much boilerplate code, or require too much computer science expertise to use well. The recent National Research Council report, “The Future of Computing Performance: Game Over or Next Level?” by the Committee on Sustaining Growth in Computing Performance stated: "Finding: There is no known alternative for sustaining growth in computing performance; however, no compelling programming paradigms for general parallel systems have yet emerged." [Fuller & Millett 2010, p.81]

Ultimately, no matter what kinds of parallel computing paradigms may be created [2], they will still be paradigms, which will require programmers to adapt their own algorithms to properly fit within. Long-term, the only sustainable approach to parallelization is the one that requires close to zero programmer effort: “Write once, parallelize anywhere”.

The lead authors of the NRC report observe [Fuller & Millett 2011] that “The intellectual keystone of this endeavor is rethinking programming models so that programmers can express application parallelism naturally. This will let parallel software be developed for diverse systems rather than specific configurations, and let system software deal with balancing computation and minimizing communication among multiple computational units.”

Without finding the appropriate computing model for generalized automatic parallelization, some great challenges lie ahead in the big data era. In fact, if we plan to simply rely on explicit programmer parallelization, the problems that arise from simply trying to make use of the available cores may prove significantly worse than the lack of efficiency of single-threaded code, as described below.

The multicore dilemma meets big data: incidental casualties

It is widely accepted that we are currently facing not just the apparent imminent “demise of Moore’s Law” [3], but also the arrival of the big data era. However, there are several important ramifications of the confluence of these two phenomena that have been widely overlooked:

  1. We are facing an impending software train-wreck: Very few people have considered the fact that as per-core speeds plateau, we are not just headed towards a plateau in software speed, but in our desire to continue the progress of Moore's Law, the human inability to write good multithreaded code is actually leading us towards an era of significantly buggier software. If we do not solve the multicore dilemma soon, many of the 99.99% of programmers who should never be writing multithreaded code will need to do so due to the arrival of massive quantities of data in all fields, potentially leading to a massive software train-wreck.
  2. We are facing an impending software morass: The world may also be headed for an unprecedented software morass, as all programmers and software development teams that care about performance begun to get mired in the details of having to multithread or otherwise parallelize their code. All current platforms, libraries and techniques for making use of multiple cores introduce the following issues:
  1. significant overhead in terms of programmer effort to write code in the first place -- the data dependencies inherent in the problem must be determined manually by the programmer, and then the programmer’s design must be shoehorned into the parallelization model afforded by the framework or tools being employed;
  2. significant increased complexity to debug, maintain and extend a multithreaded application once developed, due to increased boilerplate code, increased logical complexity, increased opportunity for introducing very hard-to-trace bugs (race conditions and deadlocks), and due to the fact that the presence of a significant amount of parallelization code, structured in an unnatural way, obscures the original programmer’s intent and the design of the program;
  3. a significantly longer iterative development cycle time, which can cause large-scale projects to break down and ultimately fail; and
  4. significant extra economic cost for employing teams to write and maintain parallel code.
  1. We urgently need automatic parallelization for the hoardes of “mere mortal” programmers that are coming online: We urgently need a programming language for the masses that does not require a background in parallel and distributed systems. The world has not only a lot more data today, but since everybody now has a big data problem, a lot more programmers -- of all skill sets and skill levels, and from all fields -- are being required to parallelize their code. Furthermore, because almost all major breakthroughs in science today happen at the intersection between different disciplines, many programmers with little or no formal CS background (and therefore no training on the complex issues inherent in concurrency) are writing code, and need the benefit of automatic parallelization to make use of available CPU resources to decrease runtime. Most programmers in the life sciences end up cobbling together shell scripts as makeshift work queues to launch enough jobs in parallel to make use of CPU resources. It will soon no longer be possible to do biology without doing computational biology, or physics without doing computational physics, etc. -- and it will soon no longer be possible to do computational science of any form without making use of all available cores. The NRC report gave the following advice:

    "Recommendation: Invest in research and development of programming methods that will enable efficient use of parallel systems not only by parallel systems experts but also by typical programmers" [Fuller & Millett 2010, p.99]
  1. After the big data era will be “the messy data era”, specifically, the era when the focus of research effort is not how to deal with the quantity of data being generated, but on its quality or qualities. Data types are becoming increasingly diverse, increasingly varied in numbers and types of attribute, increasingly sparse, increasingly cross-linked with other disparate datasets, and increasingly social in nature. We will not be free to attack problems of data quality if the answer to our data quantity problems are “use this library”. The only acceptable answer to the multicore dilemma is, “Build a smarter compiler that will automatically parallelize your code and then get out of the way, allowing programmers get on with more important issues”.
  2. Slowdowns in the progress of Moore’s Law will cause bottlenecks in progress in other exponentially-progressing, data-intensive fields: Many fields of science and technology are experiencing exponential growth or progress of some form. Nowhere is this more obvious than in the cost of genome sequencing, which was already decreasing exponentially in cost a comparable rate to Moore’s Law, but began to drop at an even more precipitous rate with the advent of next-gen sequencing techniques [see figure below]. However, within the last couple of years, next-gen sequencing has begun to produce so much data that our ability to store and analyze it has become a huge bottleneck. This is likely to be a major factor in the fast return of the sequencing cost curve back to the usual Moore’s Law gradient, as pictured. All data-intensive fields will be rate-limited by our ability to process the data, and until a solution to the multicore dilemma is solved, the degree of rate limiting will be significantly worse than it should be or could be.
DNA Sequencing costs: Data from the NHGRI Large-Scale Genome Sequencing Program. [NHGRI 2012]

The solution to the multicore dilemma: “the sufficiently smart compiler”

It is clear that the multicore dilemma is one of the most pressing issues directly or indirectly impacting all of scientific research today, and that a robust solution must be found as quickly as possible. Paul Graham recently wrote [Graham 2012]:

"Bring Back Moore's Law: The last 10 years have reminded us what Moore's Law actually says. Till about 2002 you could safely misinterpret it as promising that clock speeds would double every 18 months. Actually what it says is that circuit densities will double every 18 months. It used to seem pedantic to point that out. Not any more. Intel can no longer give us faster CPUs, just more of them.

"This Moore's Law is not as good as the old one. Moore's Law used to mean that if your software was slow, all you had to do was wait, and the inexorable progress of hardware would solve your problems. Now if your software is slow you have to rewrite it to do more things in parallel, which is a lot more work than waiting.

"It would be great if a startup could give us something of the old Moore's Law back, by writing software that could make a large number of CPUs look to the developer like one very fast CPU. There are several ways to approach this problem. The most ambitious is to try to do it automatically: to write a compiler that will parallelize our code for us. There's a name for this compiler, the sufficiently smart compiler, and it is a byword for impossibility. But is it really impossible? Is there no configuration of the bits in memory of a present day computer that is this compiler? If you really think so, you should try to prove it, because that would be an interesting result. And if it's not impossible but simply very hard, it might be worth trying to write it. The expected value would be high even if the chance of succeeding was low." 
[emphasis added]

Mary Hall et al. [Hall 2009] furthermore observe that “exploiting large-scale parallel hardware will be essential for improving an application’s performance or its capabilities in terms of execution speed and power consumption. The challenge for compiler research is how to enable the exploitation of the [processing] power of the target machine, including its parallelism, without undue programmer effort.”

The “sufficiently smart compiler” cannot actually exist (for modern imperative programming languages)

Functional programming languages are implicitly parallelizable, because calls to pure functions have no side effects. Many efforts to partially or fully automatically parallelize imperative programming languages have been attempted, such as OpenMP for C/C++, with varying success and varying levels of guarantee as to safety of generated code. Generally, the compiler must limit actual automatic parallelization to a small subset of operations that can safely be executed in parallel with a high degree of confidence, but there is a much larger subset of operations that could be parallelized at much greater risk of introducing race conditions or deadlocks into the code.

It turns out to be not just hard, but uncomputable to automatically parallelize an imperative programming language in the general case, because the actual data dependency graph of an imperative program (the graph of what values are used to compute what other values) can’t be known until runtime when values are actually read from variables or memory locations. Any static analysis can only guess at the origin or identity of specific values that might be read from a variable at runtime without actually running the code, hence the uncomputability of static data dependency analysis in the general case. (For example, reducing a program to Single Static Assignment form does not give any guarantees that values won’t change underneath your feet at runtime due to pointer aliasing etc.) Therefore, the sufficiently smart compiler cannot actually exist for modern imperative programming languages.

We propose a completely new programming language model for which the “sufficiently smart compiler” can be created

In spite of the fact that imperative languages cannot be automatically parallelized, the solution is not to focus our efforts on automatically parallelizing pure functional languages, because most “mere mortal” programmers are not able to be productive in functional programming languages. We need to create a new class of programming languages that represents the maximal subset of the imperative programming model that permits automatic parallelization, so that the language “feels” normal to average programmers, but so that it provides safely parallelized code with close to zero effort.

Restated, we must find the axiomatic reason why imperative programming languages are not automatically parallelizable, and produce a language that feels as imperative as possible, but that is properly and minimally constrained so that all valid programs may be automatically and safely parallelized by the compiler.

In this proposal we present a necessary and sufficient definition of a language model that satisfies these constraints, and we propose the development of a compiler to implement it.

  1. [Fuller & Millett 2010] -- Fuller & Millett (Eds.), The Future of Computing Performance: Game Over or Next Level? Committee on Sustaining Growth in Computing Performance, National Research Council, The National Academies Press, 2010
  1. [Fuller & Millett 2011] -- Fuller & Millett, Computing Performance: Game Over or Next Level? IEEE Computer, January 2011 cover feature.
  2. [Graham 2012] -- Paul Graham, Frighteningly Ambitious Startup Ideas, March 2012.
  3. [Hall 2009] -- Mary Hall, David Padua, and Keshav Pingali, 2009, Compiler research: The next 50 years, Communications of the AC 52(2): 60-67.
  4. [NHGRI 2012] -- DNA Sequencing costs: Data from the NHGRI Large-Scale Genome Sequencing Program. NHGRI, last updated January 2012.

[1] The multicore dilemma can be defined as the combination of three factors: (1) the so-called “end of Moore’s Law” (as we hit up against several limits of the laws of physics in trying to increase per-core speeds), (2) the wide availability of multi-core CPUs as we begin “the multicore era”, as Moore’s Law continues as an exponential increase in the number of cores, rather than per-core speed, and (3) the lack of good tools, languages and techniques for reliably and easily making use of multiple CPU core resources without significant effort or difficulty.

[2] Examples of current paradigms include MapReduce/Hadoop; the abstractions in a concurrency framework like java.lang.concurrent; various forms of message passing (such as in Erlang) / actors; channels; co-routines; futures; software transactional memory; array programming; etc.

[3] Meaning the demise of what is called Moore’s Law in the common vernacular (that CPU speeds double approximately every two years) rather than the original form of Moore’s Law, which holds that the number of transistors that can be placed inexpensively on a chip doubles approximately every two years. Moore’s Law in its original form should continue to advance like clockwork for the next few years at least, but the same increase in the number of transistors on a die is now primarily due to an increase in the number of cores.


Darwin was (half) wrong

Slashdot ran a story, "The science of handedness". I'm pretty sick of reading this sort of thoughtlessness describing evolutionary biases. If you're going to say that an adaptation gives a reproductive or predatory advantage, then fine, you're talking Darwinian evolution -- survival of the fittest -- and that's pretty trivially easy to show, even in a lab. If you're going to say, "Everybody in chummy societies had the same handedness so they could share tools", then please tell me how that weak-sauce tiny (or effectively zero-magnitude) biological fitness bias is supposed to have produced a genotypic change to an entire species within the known anthropological lifetime of the species. Remember that Darwinism requires that for your random trait variation to survive and thrive, at a minimum you have to pass your genes on while somebody else does not. So, I have to wonder if the authors "believe in" Darwinism or if they do not.

This gets at what I think is a much bigger issue: fundamental to Darwinism is not just survival of the fittest, but also randomness. I think that true biological evolution -- what's actually happening in the real world -- is not Darwinism, because it is very non-random. It is inconceivable that the complexities of the human organism, or any other for that matter, could have occurred by chance via a random walk through the state-space of possible genetic mutations (many of which could easily give rise to non-viability) in the number of generations since the major forks in the tree of life. There just isn't enough time, enough generations. There isn't sufficient evidence of non-viability, through miscarriage etc., for the worst mutations to die out -- and there isn't sufficient evidence that most traits that are said to evolve through "survival of the fittest" actually gave the possessor of that attribute an actual survival advantage, a reproductive advantage or an advantage as a predator, at the expense of those that did not possess that attribute.

What's the alternative? Even setting religious issues completely aside, personally I think that built into every feed-forward mechanism in biology, crossing back across vast numbers of levels of emergent complexity, are corresponding feed-BACK mechanisms (actually, back-propagation mechanisms, to use the machine learning terminology) such that a system's biology -- and even its genome -- can "learn" in response to environmental stimuli. Everything we have come to understand about learning and optimization from the field of machine learning supports the hypothesis that to learn anything, at any appreciable rate, you must introduce feedback loops that back-propagate the error between expected and observed in some way such that the model can be updated to reduce the error for future predictions. In other words, mutations (and epigenetics) are very NON-random, driven by the environment and by life-experiences and even by the conscious choices of the host organism. This is much less about Lamarckism (although epigenetics are proving Lamarck was pretty much right) and much more about *directed* evolution (i.e. evolution being a biological learning and optimization problem).

In summary, I claim that Darwin was (half) wrong: evolution is about fitness, but optimizing for a given fitness function is not necessarily a random walk.


How to legally work in the US as a student

Somebody asked me how to gain work authorization in the US as a student. I have lived here for 11 years on student (F-1) and working (H1-B) visas. Here is the quick summary I sent back about what I have learned:


During your time as a student on an F-1 visa, you can work 20 hours a week but it has to be on-campus. The best sort of job is a research assistantship or teaching assistantship, since the college will often also pay your tuition.

As far as working off-campus: the easiest thing is to apply for one-year CPT (Curricular Practical Training) during your degree, and/or one-year OPT (Optional Practical Training) after graduation. If you are doing a STEM degree (Science, Technology, Engineering, Mathematics), then you can apply for a 17-month extension once your 12-month OPT is finished. CPT and OPT give you an actual EAD (Employment Authorization Document - a card) which gives you employment authorization, so now you can work more than 20 hours a week and you can work off-campus. You can even work multiple jobs, one or more of which can be self-employment.

After OPT, the easiest thing to do is get an H1B to work at a specific company. (Or marry a US citizen.) Generally a company will sponsor you for two 3-year H1B visas before sponsoring you for a green card. You can't apply for a green card directly from an F-1 visa, it is a "non-immigrant intent" visa. You can apply directly from an H1B, it is a "dual-intent visa".

Another option is to apply for an O-1 visa ("alien of extraordinary abilities") if you have some major award or accomplishment that is equivalent to national recognition in the US. However an O-1 has non-immigrant intent. If you ever want a green card, the better option is an EB-1 visa, which is dual-intent. You can then apply for a green card on the National Interest Waiver program, which allows extraordinary aliens to get fast-tracked to a green card if you can prove that it is in the United States' interest to do so. The NIW qualification can be proved by a string of high-profile publications in top journals, among other things.

There's also the green card "diversity lottery", which you should apply for every single year, since it doesn't actually count against you as an attempt to gain citizenship when they ask you at the border if you have ever sought citizenship on your F-1 "non-immigrant intent" visa (you don't have to declare dv lottery attempts): https://www.dvlottery.state.gov/ (Beware of all other sites than this one, some sites will charge you $10 to apply, this one is free and is the official site.) Applications open later in the year. Your chances at getting a diversity lottery green card range from quite high to vanishingly small depending on what country you are from.

There is also a startup founder visa that has been proposed, and the bill is, going through the system right now. You have to employ a certain number of Americans, and bring in a certain amount of funding within the first 1-2 year period for the visa to get renewed.

Here is info on the startup visa act. Everybody should consider supporting this.


The last link also talks about the possibility of working on an EB visas. Notably, EB-5 is a category for people investing a lot of money in the US. This can get you in the door if you can afford it.


Update: JBQ posted the following comments on my G+ post that links to this blog post:

Chances are, if you'd qualify for an O-1, you'd also be EB1 in the green card process and that'd probably be a very easy path.

Also, there's no benefit in waiting to apply for a green card when starting a new H-1B job. It just delays the priority date and therefore the waiting time.

Finally, the total experience doesn't matter when converting an H-1B to a green card, what matters is the experience when getting hired. Switching companies can be beneficial, as there can be enough experience to move from EB3 to EB2, while keeping the priority date. It's best to do that with more than a year left before the 6-year line, and it's best to do that with a priority date far enough in the past.

IIRC there are no specific requirements to apply for a green card.

As it was for me, the process starts with a certification by DOLETA that there are no citizens or residents to fill the job (similar to an H-1B, but a bit more thorough). Applying for this also sets the priority date.

Once that's done (a few weeks), the next step is to apply for the visa petition (proving that the employee is qualified and that the employer can pay them). That's also similar to an H-1B. That can take a few months IIRC. This is the I-140. Don't wait long as the labor certification is only valid for a limited period.

Once that's done, the next step is to wait for an available visa. There are 3 waiting lists per country of citizenship, based on the skill level. The waiting lists are shorter for the categories with the highest skills. Each list is represented by a cutoff date, and if your priority date is earlier than the cutoff date for the category you're in you're eligible for the next step.

Finally, once a spot is available, applying for an I-485 adjustment of status turns the H-1B into a green card. That also takes a few months IIRC.


Intolerance: criticizing what somebody *is*, not they ideas they believe in

Somebody just asked the following question to the csail-related mailing list about the appropriateness of political slurs during public talks:

On Wed, Mar 14, 2012 at 9:19 PM, IJ wrote:
I was wondering what people in CSAIL think about speakers including gratuitous political insults in their talks.

I was at a talk last week at HMS about systems-based analysis of disease. The speaker, Joseph Loscalzo (Chairman of Medicine at the Brigham), said that before Hippocrates people thought illness was caused by evil spirits. He then added that this view is shared by Republicans.

Coming from a background in industry where one often encounters very nice, very intelligent people of all political leanings, I found it shocking that the speaker would be so unprofessional as to insult people who, for whatever reason, have a political affiliation different from his own. Still worse was his subsequent joke that there might "even" be some Republicans in the audience, with its presumption that all or almost all of his audience must share his political views. I thought he would next suggest that if we spot one of these Republicans we might examine him or her as an interesting specimen!

As discussed in this New York Times article, and borne out by my own acquaintances, academia can be a hostile environment to people who are not liberal Democrats (www.nytimes.com/2011/02/08/science/08tier.html). The researcher mentions how these non-liberals remind him of closeted gay students in the 1980's, how they "hid their feelings when colleagues made political small talk and jokes predicated on the assumption that everyone was a liberal." I know of medical students who were justifiably afraid that they would be discriminated against if their political affiliations were "outed".

I hope that in CSAIL we would not tolerate remarks like these that create a hostile environment for any of our members, students, or guests, whether they are women, gays, or even Republicans.

Does CSAIL have a policy on this?

Should it?

RMS jumped in with the following:

On Thu, Mar 15, 2012 at 12:22 AM, Richard Stallman wrote:
People are responsible for their opinions; criticizing and even
condemning political opinions is a normal part of political discourse.

Here's my take on it:

There is a difference between criticizing a theory that somebody holds, and something that a person *is*. The former is the foundation of academic discourse, and theories must be able to withstand scrutiny to be of value. The latter -- criticizing what somebody is, and either hasn't chosen to be (in the case of phenotypic attributes -- skin color, gender etc.) or has chosen to be, by culture or agency (religion, political orientation etc.), is intolerance. In the context of the original poster's situation, it is OK to criticize a theory about how the government should be run, and to subject that theory to academic discourse about relative merits or lack thereof. It's not OK to poke fun at somebody's *identity* as a person that follows a given ideology ("one of those people", in label-speak).

I really don't like the word "intolerance" however: the grand irony is that much of the time that this word is used, it is used to superciliously indicate that another person's views are quaint, and not broad enough to include one's own views. Whether or not that is true, unless one party is being harmed, tolerance must be extended in both directions, or a claim of intolerance is plain hypocrisy.

The word "intolerance" seems to have therefore lost a lot of its real meaning, because it is often used in this self-serving way. Is there a less worn-out word available for use than "intolerance"? (I guess this is why policies about this generally refer to harassment, not intolerance, because they deal with cases where one party is in fact harmed?)

Stephen Wolfram, Quantified Self and LifeScope

Stephen Wolfram's recent blog post that presents a visualization of a lot of his life data is opening a lot of eyes. I submitted an app to the original Android Developer Challenge back in 2008 that produced many of the same exact data plots. I called it LifeScope. It contains its own custom graph plotting library and a completely generic data handling backend that you should be able to plug a range of different data sources into.

Stephen Wolfram's blog post has inspired me to clean up my code and get the app out. Please leave a comment if you would find this app useful, and let me know what data sources in your life you would be interested in plotting.


On life, death and so-called "brain death"

Cross-posted from my reply to the following TED Conversation: "How does life/death manifest itself in the human brain? Is brain death the ultimate end stage of life?"

There is something to the fact that, to maintain a living state, the brain requires a pattern of oscillatory activity with the power distributed in certain frequency bands according to the type of activity that the brain is engaging in. (See Rhythms of the Brain by G. Buzsáki.) However, even after a massive epileptic seizure, which typically indicates a widespread state of electrical noise, the brain is usually able to recover these baseline rhythms.

I think in persistent vegetative state, the brain has very little normal electrical activity, but still, the activity is non-zero -- and the brain appears able to wake itself in some cases. There are stories of people waking up from PVS after several years. It's also curious that you can keep a person's body alive for a long time after their brain is declared "dead" as long as you keep blood flowing and oxygen and nutrients at the right levels. Personally I think that implies the organism couldn't really be declared dead to start with. I don't think it's possible to accurately declare an organism dead until rigor mortis sets in and its microbiome begins to consume it -- in my opinion, decay and the succumbing to entropy is the only true sign of death -- and these forces are set in motion very quickly once an organism "actually dies".

Note that recent research has shown that administering an intravenous dose of Ritalin to a comatose mouse can cause the mouse to wake up almost instantly. They have yet to start human trials, but this may hold real hope for "rebooting the brain".

How long should we keep a PVS patient alive for though? Is it worth 20 years of stress on the family and untold cost of life support? I don't know, but I would say that we need a better understanding of the types of baseline electrical situations from which the brain is able to reboot before we can authoritatively say we know that a patient is actually "brain dead", i.e. beyond the chance of recovery.


KNIME -- the Swiss-army knife of data workflow

I just discovered the following Swiss-army knife for handling data workflow. I have wanted something like this to exist for years, and was getting close to starting my own project with almost identical goals, but I guess I don't have to now:

KNIME lets you build a data analysis pipeline, complete with data normalization and filtering, inference/classification and visualization steps. It caches data at each node in the workflow (so changes to the pipeline only result in the minimum necessary recalculation), and keeps track of which experimental variables produced which results. It intelligently makes use of multiple cores on your machine wherever possible. It incorporates the entire Weka machine learning framework. It lets you add your own visualizers for different data types. It cross-links the highlighting of data points between different tables and views, so that if you select a data point in one view, it selects it in all other views. It reads and writes a large number of different data formats and can read from / write to a live database. You can call out to R at any point if you have existing R code you need to run on a piece of data.

i.e. KNIME basically does everything that anybody who works with data does every day, and keeps everything tied together in a nice workflow framework with built-in data visualization, smart caching, smart parallel job launching etc. etc.