On hierarchical learning and building a brain

The brain is a hierarchical learning system, and knowledge representation is inherently hierarchical

Many systems in the human brain are structured hierarchically, with feedback loops between the levels of hierarchy. Hierarchically structuring a system creates a far more compact and flexible recognition engine, controller or model than alternatives. Knowledge and learning are inherently hierarchical, with generalizations as higher-level constructs and specifics as lower-level constructs. Assimilating new knowledge often requires breaking old hierarchies (undergoing a paradigm shift) and restructuring existing knowledge in terms of new generalizations, so that the new knowledge can be properly incorporated. The hierarchical structuring of reasoning may not be surprising given that the wiring of the brain itself is highly hierarchical in structure.


Strengths of hierarchical learning

Hierarchical reasoning systems can be immensely dextrous, just as an articulated arm is dextrous: each joint in an arm has only one or two degrees of freedom, but these degrees of freedom compound, overall yielding an immensely rich range of motion. In a hierarchical reasoning system, global, mid-level and local reasoning engines collaborate in feedback with each other to converge on a single hypothesis or plan of action that is consistent at multiple levels of abstraction. Each level only needs to absorb a small amount of noise, yet the overall system can be immensely resilient due to the synergistic combining of degrees of freedom. Note that the feedback between the layers of hierarchy can be either excitatory or inhibitory, sensitizing or desensitizing other levels towards certain hypotheses based on the current best-guess set of hypotheses at the current level. (This is effectively a manifestation of Bayesian priors.)

Also, consistent false positives and false negatives at individual levels can actually help improve performance of a hierarchical reasoning system, because each level can learn to work with systematic quirks of higher and lower levels. A hierarchical learning system is more powerful than the sum of its parts, because error correction is innate.

Note that modularity and hierarchy are ubiquitous across all of biology, at all levels of complexity, and this appears to be a direct result of trying to minimize communication cost. Biology is particularly good at producing a massive fan-out of emergent complexity at every layer of hierarchy, and then packaging up that complexity inside a module, and presenting only a small (but rich) "API" to the outer layers of complexity. This can be observed in the modularity of proteins, organelles, cells, organs and organisms.

Learning is a process of information compression

It is interesting to note that hierarchical learning is related to to progressive encoding, as found in JPEG and other image compression algorithms, where an image is encoded in low resolution first, and then progressively refined by adding in detail that remains as the difference between lower-order approximations and the original image. Progressive encoding isn't just useful for letting you see the overall content of an image before the complete image has loaded -- it also increases compression ratios by decreasing local dynamic range.

In fact, in general, learning is a process of information compression -- Grandmaster chess players never evaluate all possible moves, they compress a large number of move sequences and board positions into a much smaller number of higher-order concepts, patterns and strategies -- so it would make sense that learning is innately hierarchical.

Error correction improves accuracy of inference

As noted, error correction is innate in hierarchical systems. In fact, the principles of error correction, as defined in information theory, can be directly applied to machine learning. In my own tests, adding error correction to the output codes of a handwriting recognition system can decrease error rates by a factor of three, with no other changes to the system.

However, adding error correction to a system typically implies adding redundancy to decrease entropy, by increasing the minimum Hamming distance between codewords etc. This principle lies in direct tension with the fact that learning is a process of information compression, because information compression, as we know it, typically deals with removing redundancy.

A further conundrum is presented by the fact that traditional sequence compression, where redundancy is minimized and entropy maximized, dramatically increases the brittleness of a data stream: flipping bits in a zipped file is much more likely to render the original file unreadable than flipping bits in the uncompressed file. However, biology seems to welcome "data corruption", as evidenced by how resilient the genome is to mutation (mutation actually helps species adapt over time), and as evidenced by how well the brain works with uncertainty.

The CS approach to information compression increases brittleness

The most interesting theoretical approach to unifying the two apparently opposing forces of error correction and information compression is Kolmogorov complexity or "algorithmic information theory", which states that it may require significantly less space to describe a program that generates a given sequence or structure than is needed to directly represent it (or sequence-compress it). Algorithmic compression may be used by the brain to dramatically increase compression ratios of structure, making room for redundancy. (It is certain that algorithmic compression is used in the genome, because there are 30,000 immensely complex cells in your body for every base pair of DNA in your genome.)

The criticality of feedback in the learning process

Feedback (and the ability to respond to feedback by updating a model or by changing future behavior) is the single most critical element of a learning system -- in fact, without feedback, it is impossible to learn anything. However, the brain consists of probably trillions of nested feedback loops, and the emergent behavior of a system incorporating even just a few linked feedback loops can be hard to characterize. It is critical to understand how the vast number of interacting feedback loops in the brain work together harmoniously at different scales if we hope to build a brain-like system. We have a lot of work to do to understand the global and local behavioral rules in the brain that together lead to its emergent properties.

The use of feedback in machine learning has so far mostly been limited to the process of minimization of test-set error during training (through the process of backpropagation or an analog). Time series feedback loops in recurrent networks are also immensely powerful though, and these can be trained using backpropagation through time. Recurrent networks and other time series based models can be used for temporal prediction ("Memory prediction framework" in Wikipedia). Prediction is a fundamental property of intelligence: the brain is constantly simulating the world around it on a subconscious level, comparing the observed to the expected, and then updating the predictive model to minimize the error and taking corrective action based on unexpected contingencies. Temporal prediction is a core concept in Jeff Hawkins' book On Intelligence, however Jeff's HTM framework is far too rigid and discrete in the implementation of these ideas to be generally applicable to substantial real-world problems.

How to build a brain

Building a computational structure with multiscale, feedback-based, predictive properties similar to those in the brain is critical to creating machine intelligence that will be useful in the human realm. Until we figure out how to do this, we're stuck with machine learning amounting to nothing more than the process of learning arbitrary function approximators.

Jeff Hawkins' HTM framework looks a lot more like a big Boolean logic network than the soft, fuzzy Bayesian belief network present in the brain. The basic ideas behind HTM are sound, but we need to replace HTM's regular, binarized, absolute-coordinate grid system with something more amorphous, reconfigurable and fuzzy, and we need to propagate Bayesian beliefs rather than binary signals. Building such a system so that it has the desired behavior will be a hard engineering challenge, but the resulting system should be, ironically, much closer to the principles Jeff describes in his own book.

Most importantly, however, we will have built something that functions a lot more like the human brain than most existing machine learning algorithms -- something that, through having a cognitive "impedance" correctly matched to the human brain, will more naturally interface with and extend our own intelligence.