The Multicore Dilemma (in the big data era) is worse than you think

The arrival of the big data era almost exactly coincided with a plateau in per-core CPU speeds in 2004 and the beginning of the multicore era, and yet, the National Research Council found as recently as 2010, "There is no known alternative for sustaining growth in computing performance; however, no compelling programming paradigms for general parallel systems have yet emerged." [Fuller & Millett (Eds.), The Future of Computing Performance: Game Over or Next Level? Committee on Sustaining Growth in Computing Performance, National Research Council, The National Academies Press, 2011]. Additionally, as every field of scientific research and every major business endeavor begins to generate more data than can be easily analyzed on a single CPU, it has become evident that this "multicore dilemma" is impeding progress across all of science, and across all fields of business. The multicore dilemma, therefore, presents the biggest near-term challenge to progress.
Processor performance from 1986 to 2008 as measured by the benchmark suite SPECint2000 and consensus targets from the International Technology Roadmap for Semiconductors for 2009 to 2020. The vertical scale is logarithmic, where a straight line trend indicates exponential growth. A sharp change in the growth rate can be seen around 2004. Before 2004, processor performance was growing by a factor of about 100 per decade; since 2004, processor performance (in GHz) has been growing and is forecasted to grow by a factor of only about 2 per decade. Source: [Fuller & Millett 2011]

The coining of the term "big data" is a reflection of the fact that the exponential increase in data generated per year, which has been going on for many decades, finally crossed some scale where humans sat up and started paying attention. This phenomenon occurred across a wide range of fields over a short period of time, including at the leading edge of most fields of scientific research, where many scientists (who have no training in parallel and distributed systems) are now having to spend a significant proportion of their time trying to figure out how to parallelize their analysis code. The difficulty of parallelizing code is impeding progress in all major fields of science, and the seriousness of this problem cannot be overstated.

DNA Sequencing costs: Data from the NHGRI Large-Scale Genome Sequencing Program

Ultimately, no matter what kinds of parallel computing paradigms may be created, if they require conscious engineering to use, they will incur cognitive overhead for programmers, and significant programmer knowledge and expertise, which is not possessed by the large number of non-skilled programmers (e.g. medical researchers) who must write parallel code just to get their job done. Long-term, the only sustainable approach to parallelization is the one that requires close to zero programmer effort: “Write once, parallelize anywhere”.

Figuring out what to parallelize, and how to parallelize it, is not a problem we should be assigning to humans to solve. The lead authors of the NRC report further observe that “The intellectual keystone of this endeavor is rethinking programming models so that programmers can express application parallelism naturally. This will let parallel software be developed for diverse systems rather than specific configurations, and let system software deal with balancing computation and minimizing communication among multiple computational units.” [Fuller & Millett 2011] Without finding a better model for automatic parallelization, some great challenges lie ahead in the big data era.

The multicore dilemma meets big data: incidental casualties

There are several important ramifications of the confluence of "the end of Moore's Law" (as it is traditionally interpreted) and the beginning of the big data era:

  1. We are facing an impending software train-wreck. Concurrency introduces whole new classes of potential bugs, and the human brain is notoriously bad at writing safe parallel code. As per-core speeds plateau, we are not just headed towards a plateau in software speed, but an era of significantly buggier software.
  2. We are facing an impending software morass. Parallel code, using current parallelization paradigms, is not just harder to write, it is more complex, and therefore harder to understand. It is very easy to get mired in a spaghetti of parallel code. All current platforms, libraries and techniques for parallel processing introduce significant issues:
  1. Parallel programming requires additional programmer effort. The data dependencies inherent in a problem must be determined manually by a programmer in order to design a parallelization strategy, and the implementation must be shoehorned into some parallelization paradigm that is appropriate for a specific target architecture.
  2. Parallel programming makes debugging, maintaining and extending code significantly more difficult. Parallel programming produces boilerplate code, increases the likelihood of introducing hard-to-trace bugs, and increases the complexity of program structure.
  3. Parallel programming can result in longer development times. For large-scale projects, reining in the added complexity of parallelization can be intractable, either leading to the failure of the project or the abandoning of parallelization.
  4. Parallel programming can incur significant extra cost. Programmers who have a solid grasp of parallel programming are rare and expensive.
  1. We urgently need automatic parallelization for “mere mortal” programmers. We urgently need a programming language for the masses that does not require a background in concurrency or parallel and distributed systems. Increasingly, programmers of all skill levels, and from all fields, are being required to parallelize their code. Many programmers with little or no formal CS background (e.g. biomedical researchers) are writing code to target parallel architectures, for example using a series of duct-taped Bash scripts to launch R and MatLab jobs in a work queue on an NIH-funded compute cluster. It will soon no longer be possible to do biology without doing computational biology, or physics without doing computational physics, and dataset sizes in many fields of science long ago outstripped what may be quickly computed on one processor. The NRC report gave the following advice: "Recommendation: Invest in research and development of programming methods that will enable efficient use of parallel systems not only by parallel systems experts but also by typical programmers" [Fuller & Millett 2011]
  1. After the big data era will be “the messy data era”. Data types are becoming increasingly diverse, increasingly varied in numbers and types of attributes, increasingly sparse, increasingly cross-linked with other disparate datasets, and increasingly social in nature. We need a smarter compiler that will automatically parallelize our code and then get out of the way, relieving us of the cognitive burden of figuring out how to parallelize the code, and allowing us to focus on the more important issues of the specific problem we are trying to gain insight into.

The solution to the multicore dilemma: “the sufficiently smart compiler”

It is clear that the multicore dilemma is one of the most pressing issues directly or indirectly impacting all of scientific research today, and that a robust solution must be found as quickly as possible. Paul Graham recently wrote:

"Bring Back Moore's Law: The last 10 years have reminded us what Moore's Law actually says. Till about 2002 you could safely misinterpret it as promising that clock speeds would double every 18 months. Actually what it says is that circuit densities will double every 18 months. It used to seem pedantic to point that out. Not any more. Intel can no longer give us faster CPUs, just more of them.

"This Moore's Law is not as good as the old one. Moore's Law used to mean that if your software was slow, all you had to do was wait, and the inexorable progress of hardware would solve your problems. Now if your software is slow you have to rewrite it to do more things in parallel, which is a lot more work than waiting.

"It would be great if a startup could give us something of the old Moore's Law back, by writing software that could make a large number of CPUs look to the developer like one very fast CPU. There are several ways to approach this problem. The most ambitious is to try to do it automatically: to write a compiler that will parallelize our code for us. There's a name for this compiler, the sufficiently smart compiler, and it is a byword for impossibility. But is it really impossible? Is there no configuration of the bits in memory of a present day computer that is this compiler? If you really think so, you should try to prove it, because that would be an interesting result. And if it's not impossible but simply very hard, it might be worth trying to write it. The expected value would be high even if the chance of succeeding was low."

[Paul Graham, Frighteningly Ambitious Startup Ideas #6, March 2012]

It has furthermore been observed that “exploiting large-scale parallel hardware will be essential for improving an application’s performance or its capabilities in terms of execution speed and power consumption. The challenge for compiler research is how to enable the exploitation of the [processing] power of the target machine, including its parallelism, without undue programmer effort.” [Mary Hall, David Padua, and Keshav Pingali, 2009, Compiler research: The next 50 years, Communications of the AC 52(2): 60-67.]

The “sufficiently smart compiler” cannot actually exist (for modern imperative programming languages)

Functional programming languages are implicitly parallel (automatically parallelizable), because calls to pure functions have no side effects. However, “mere mortal” programmers are not able to be productive in functional programming languages, or are unwilling to use them.

Many efforts to automatically parallelize imperative programming languages have been attempted with varying success, but with imperative languages, the compiler must be overly cautious in applying parallelization strategies, because it is not just difficult but actually uncomputable in the general case to safely parallelize imperative code, because the data dependency graph of an imperative program (the graph of which values are used to compute which other values) can’t be known until you actually run the code. Static analyzers can often only guess at the origin or identity of a specific value without actually running the code. The sufficiently smart compiler cannot actually exist for modern imperative programming languages.

We need a new programming model that allows programmers to work in the maximal implicitly parallel subset of the imperative paradigm while allowing the compiler to safely automatically parallelize the code.