The brain is a hierarchical learning system, and knowledge representation is inherently hierarchical
Strengths of hierarchical learning
Hierarchical reasoning systems can be immensely dextrous, just as an articulated arm is dextrous: each joint in an arm has only one or two degrees of freedom, but these degrees of freedom compound, overall yielding an immensely rich range of motion. In a hierarchical reasoning system, global, mid-level and local reasoning engines collaborate in feedback with each other to converge on a single hypothesis or plan of action that is consistent at multiple levels of abstraction. Each level only needs to absorb a small amount of noise, yet the overall system can be immensely resilient due to the synergistic combining of degrees of freedom. Note that the feedback between the layers of hierarchy can be either excitatory or inhibitory, sensitizing or desensitizing other levels towards certain hypotheses based on the current best-guess set of hypotheses at the current level. (This is effectively a manifestation of Bayesian priors.)
Also, consistent false positives and false negatives at individual levels can actually help improve performance of a hierarchical reasoning system, because each level can learn to work with systematic quirks of higher and lower levels. A hierarchical learning system is more powerful than the sum of its parts, because error correction is innate.
Note that modularity and hierarchy are ubiquitous across all of biology, at all levels of complexity, and this appears to be a direct result of trying to minimize communication cost. Biology is particularly good at producing a massive fan-out of emergent complexity at every layer of hierarchy, and then packaging up that complexity inside a module, and presenting only a small (but rich) "API" to the outer layers of complexity. This can be observed in the modularity of proteins, organelles, cells, organs and organisms.
Learning is a process of information compression
It is interesting to note that hierarchical learning is related to to progressive encoding, as found in JPEG and other image compression algorithms, where an image is encoded in low resolution first, and then progressively refined by adding in detail that remains as the difference between lower-order approximations and the original image. Progressive encoding isn't just useful for letting you see the overall content of an image before the complete image has loaded -- it also increases compression ratios by decreasing local dynamic range.
In fact, in general, learning is a process of information compression -- Grandmaster chess players never evaluate all possible moves, they compress a large number of move sequences and board positions into a much smaller number of higher-order concepts, patterns and strategies -- so it would make sense that learning is innately hierarchical.
Error correction improves accuracy of inference
As noted, error correction is innate in hierarchical systems. In fact, the principles of error correction, as defined in information theory, can be directly applied to machine learning. In my own tests, adding error correction to the output codes of a handwriting recognition system can decrease error rates by a factor of three, with no other changes to the system.
However, adding error correction to a system typically implies adding redundancy to decrease entropy, by increasing the minimum Hamming distance between codewords etc. This principle lies in direct tension with the fact that learning is a process of information compression, because information compression, as we know it, typically deals with removing redundancy.
A further conundrum is presented by the fact that traditional sequence compression, where redundancy is minimized and entropy maximized, dramatically increases the brittleness of a data stream: flipping bits in a zipped file is much more likely to render the original file unreadable than flipping bits in the uncompressed file. However, biology seems to welcome "data corruption", as evidenced by how resilient the genome is to mutation (mutation actually helps species adapt over time), and as evidenced by how well the brain works with uncertainty.
The CS approach to information compression increases brittleness
The most interesting theoretical approach to unifying the two apparently opposing forces of error correction and information compression is Kolmogorov complexity or "algorithmic information theory", which states that it may require significantly less space to describe a program that generates a given sequence or structure than is needed to directly represent it (or sequence-compress it). Algorithmic compression may be used by the brain to dramatically increase compression ratios of structure, making room for redundancy. (It is certain that algorithmic compression is used in the genome, because there are 30,000 immensely complex cells in your body for every base pair of DNA in your genome.)
The criticality of feedback in the learning process
Feedback (and the ability to respond to feedback by updating a model or by changing future behavior) is the single most critical element of a learning system -- in fact, without feedback, it is impossible to learn anything. However, the brain consists of probably trillions of nested feedback loops, and the emergent behavior of a system incorporating even just a few linked feedback loops can be hard to characterize. It is critical to understand how the vast number of interacting feedback loops in the brain work together harmoniously at different scales if we hope to build a brain-like system. We have a lot of work to do to understand the global and local behavioral rules in the brain that together lead to its emergent properties.
How to build a brain
Building a computational structure with multiscale, feedback-based, predictive properties similar to those in the brain is critical to creating machine intelligence that will be useful in the human realm. Until we figure out how to do this, we're stuck with machine learning amounting to nothing more than the process of learning arbitrary function approximators.
Jeff Hawkins' HTM framework looks a lot more like a big Boolean logic network than the soft, fuzzy Bayesian belief network present in the brain. The basic ideas behind HTM are sound, but we need to replace HTM's regular, binarized, absolute-coordinate grid system with something more amorphous, reconfigurable and fuzzy, and we need to propagate Bayesian beliefs rather than binary signals. Building such a system so that it has the desired behavior will be a hard engineering challenge, but the resulting system should be, ironically, much closer to the principles Jeff describes in his own book.
Most importantly, however, we will have built something that functions a lot more like the human brain than most existing machine learning algorithms -- something that, through having a cognitive "impedance" correctly matched to the human brain, will more naturally interface with and extend our own intelligence.