Complementary Learning Systems
Our starting point is the seminal work by James McClelland, Bruce McNaughton, and Randall O'Reilly on the CLS view of memory (McClelland, McNaughton, & O'Reilly, 1995; Norman & O'Reilly, 2003; O'Reilly & Norman, 2002). This view divides the cognitive architecture for memory into two systems. First, there is a high- fidelity "surface" system, located primarily in the hippocampus and associated regions of the medial temporal lobe. It keeps detailed records of experiences as they transpire. This system has a sizable storage capacity but has minimal learning and conceptualization abilities; it doesn't try to make sense of patterns within these records. Second, there is a "deep" system that is primarily located in the neocortex. This is a high intelligence, high abstraction system that is specialized for extracting statistical regularities, generalizations, and patterns from the data. Importantly, for the deep system to do its work, it needs to be presented with a large number of examples from the domain of interest. The surface system is well positioned to provide these examples because it stores detailed records of experience in its high-fidelity library. The CLS framework proposes complementary interactions between the two systems, with the surface system feeding examples from its sizable library in an ongoing way to the deep system and thereby facilitating high-level pattern learning.
Importantly, the idea of repeated presentation plays a central role in the CLS architecture. This is the idea that certain forms of learning, especially learning of patterns and generalizations across a set of examples, are substantially facilitated by presenting these examples multiple times and in multiple ways. We will discuss the computational advantages of repeated presentation as we proceed.
The CLS picture is widely accepted by memory researchers (Frankland & Bontempi, 2005; Moscovitch, Nadel, Winocur, Gilboa, & Rosenbaum, 2006; Squire & Bayley, 2007) and is supported by multiple lines of evidence. First, there is excellent neurobiological evidence for a functional dissociation between the hippocampal and neocortical systems proposed in the CLS model. The hippocampus exhibits a number of "design features" that enable it to play the role of the high-fidelity surface learner. Regions of the hippocampus, especially region CA3, have reduced cellular density, regular latticed arrangement of neurons, and lower levels of neuronal firing compared to the neocortex (Barnes, McNaughton, Mizumori, Leonard, & Lin, 1990; O'Reilly & Norman, 2002). These features are well suited for creating sharply demarcated memory representations in which even highly similar stimuli get distinct representations.
The hippocampus is also unique in being highly plastic (Marr, 1971; McNaughton & Morris, 1987); that is, it exhibits a remarkably rapid learning rate, which is essential if it is to store in real time high-fidelity representations of experience. Perhaps the best known and most extensively studied example of rapid plasticity in the brain is the phenomenon of long-term potentiation (Bliss & Collingridge, 1993), which occurs in multiple subregions of the hippocampus and associated structures. This is a Hebbian learning process (fire together, wire together) in which groups of temporally co-active neurons exhibit persistent strengthening of their synaptic connections lasting for weeks to months. Long-term potentiation is thought to explain the ability of the hippocampus to rapidly learn arbitrary associations between two stimuli or between stimuli and spatiotemporal context, including various instances of "one shot" learning in which associations are formed within a single training episode (Nakazawa et al., 2003).
With respect to all of these design features, the neocortex is almost precisely reversed (O'Reilly & Norman, 2002) It has high cell density, irregular neuronal arrangements, and rapid firing rates, all of which produce overlapping representations for similar stimuli. In addition, rather than rapid Hebbian learning, the neocortical system uses various forms of error-correction learning of the sort we saw in Chapter 2. This kind of learning is slow and iterative, and we will discuss it in more detail in a moment.
In short then, consistent with the CLS model, the neurobiological evidence supports specialization in the hippocampus and neocortex: The hippocampus is optimized for the separation of representations and retention of detail while the neocortex is optimized for integration of representations and forming abstractions from the details.
Turning now to a second line of evidence supporting the CLS model, a key postulate of the model is that that the hippocampus repeatedly presents high-fidelity records of experience to the cortical deep learning system. There is compelling neurobiological evidence that this process in fact occurs (O'Neill, Pleydell-Bouverie, Dupret, & Csicsvari, 2010). A vivid illustration comes from studies (Davidson, Kloosterman, & Wilson, 2009; Lee & Wilson, 2002) of the firing patterns of so-called hippocampal place cells that fire selectively to certain locations as an animal explores its surroundings. As a consequence, when the animal travels a certain trajectory, these place cells fire in a distinctive sequence. During subsequent periods of quiet rest or slow wave sleep, these same place cells are repeatedly reactivated in brief burst patterns (Buzsaki, 1989; Girardeau & Zugaro, 2011) with concurrent increased hippocampal-cortical communication. Critically, the order of place cell activation during each burst is the same as during previous rounds of exploration (although the firing rate is dramatically speeded up, with a "virtual speed" of roughly 8 meters per second; Davidson et al., 2009). This supports the idea that the hippocampus is repeatedly presenting trajectories from previous rounds of exploration in a way that would facilitate cortical deep learning of the gist, in this case, abstract spatial relations.
A third line of evidence for the CLS model comes from human lesion studies. It has long been known that lesions to the hippocampus produce retrograde amnesia for declarative memories, especially for memory of autobiographical episodes. Interestingly, amnesia is often time-limited with memories from the more remote past spared (Squire & Alvarez, 1995; Squire & Bayley, 2007). The CLS framework nicely explains this pattern. The neocortical system stores generalizations and statistical regularities from hippocampal inputs, resulting in partial redundancy and overlap in the mnemonic contents of the two systems. The formation of neocortical memory traces, however, is slow and iterative, and thus requires extensive time for consolidation. This explains why there is preferential sparing of remote memories with hippocampal damage; only those memories that have had sufficient time for neocortical stabilization and consolidation are spared. If the neocortex has not had time to extract patterns from hippocampal memories of relatively recent events, then there will be complete amnesia for these events.
A fourth line of support for the CLS model comes from considering its computational rationale; there are excellent computational reasons for why a two-tiered learning architecture makes sense. It might seem initially strange that learning can be facilitated by presenting memories of prior experiences again and again. Rather, it seems more plausible that once an experience has occurred, the learner should extract whatever lessons it has and then move on; little is to be gained by replaying a memory of that experience, let alone replaying it repeatedly. Perhaps McClelland, McNaughton, and O'Reilly's most penetrating insight is that there are certain learning contexts—in particular, contexts involving the learning of abstract patterns—in which repeated presentation of memories dramatically enhances learning.
McClelland, McNaughton, and O'Reilly model this kind of high- level "pattern learning" of an artificial neural network consisting of an input layer, output layer, and multiple hidden layers (McClelland et al., 1995). The network is initialized with random weights between the nodes. Training examples consist of specific input-output pairings; if the network fails to generate the correct predicted output with the presentation of the input, then this leads to small, incremental adjustment of the weights in the network in a direction that reduces the discrepancy. Over time, the network faithfully reproduces the input-output mappings with which it was trained.
Artificial neural networks can uncover hidden patterns of similarity among inputs and generalize learning to new unseen inputs, and there is evidence they do this in ways that more closely resemble human learners (Elman, 1998; Rumelhart, McClelland, & PDP Research Group, 1986; White, 1989) Patterns of similarity are encoded in the weights in the hidden layers; inputs that are similar share similar configurations of weightings. For example, McClelland et al. (1995) discuss the example of a network that learns about living things (plant, animal, pine, oak, robin, sunfish, etc.) and their properties (is big, can fly, has gills, etc.). After training, the network weightings reflected that oak is similar to pine, and both are quite different from a canary or a robin.
Training of an artificial neural net must proceed in a slow and iterative way. Weights are adjusted a small amount at a time. Large adjustments would "overfit" the network to the current example while the goal is for the network to uncover patterns of similarity that hold across a range of examples. In addition, McClelland et al. (1995) discuss a second critical feature of artificial neural net training: Learning should be interleaved.
Consider two ways of training a neural network to learn 10 examples (i.e., 10 input-output pairings). The first presents blocks of each example serially, say 40 times for the first example, 40 for the second, and so on. The second presents the examples in a fairly random interspersed fashion: Two or three presentations of one example are followed by another and then by another until each of the 10 examples are presented 40 times. The first scheme—the serial scheme— ultimately fails because it generates the problem of "catastrophic interference" (McClelland et al., 1995; Spivey & Mirman, 2001). The network learns the current example, but then the weights are overwritten to learn the second example, and so on. In contrast, the interleaved presentation allows learning of all the examples and will result in optimal generalization to new unseen examples. In effect, interleaved presentation allows the network to "see" a number of examples in close proximity and thus identify the hidden patterns that the examples have in common.
McClelland et al. (1995) locate the computational rationale of the CLS architecture in the need to produce iterative, interleaved trains of examples for the purposes of neural net learning. This can most readily be achieved with a two-tier architecture: One system specializes in storing high-fidelity examples and thereafter repeatedly presents them in an interleaved fashion, and the other system specializes in slow iterative learning of hidden abstract patterns in the example set.