Following up on my notion that the brain should code using symbol sequences spelt in a Latin alphabet on a bus of 10^2-10^3 axons, and that the natural matching computer implementation is recognition by inner products on unit-length vectors in a high-dimensional space:
Inner product has the interesting property that "don't-care" fields fall out almost automatically: a zero field in the template vector will result in the matching value in the state vector automatically having zero influence.
This suggests that if one wanted to return to the bitworld as a concession to computer efficiency issues, counting bits in the AND of the two vectors might be more to the point than looking at the XOR of the two vectors (grey coding?).
Other ideas might be working in logspace, with intensity represented as a log rather than absolute, to substitute additions for multiplications, or using multiple single-bit properties, representing say "signal > 2^1; signal > 2^2; signal > 2^3..." and again just counting bits. This would allow for a random mixture of precisions of different letters in the word, while still retaining the efficiency of a simple count of bits set in the AND of the two vectors:

int bits_set = 0;
for (i = max; --i >= 0; ) {
    bits_set += bits_in_byte[ input[i] & template[i] ];
}

Given a large number (10^8, say) of such template vectors, of say 128 bits each, how does one efficiently find the one(s) giving the best match to the template? Does it help if we assume half the bits are always set? How much does the address space shrink if we enforce that? Sounds like an excuse for a Thinking Machines architecture :)

-- After a swimthink --

Take it from the top:
Humans know 50K words, 100K chess positions... human expert vocabularies tend to be 10^4-10^7 chunks in size, depending on just who is counting and how and such.

Neural fanout ends about 10^4, and brain depth is something like 30 neurons, so we don't get a lot of slack to play with.

To learn a pattern reliably, a neuron much have access to logically sufficient information: If each chunk is represented by one neuron, the numbers don't work out: A single neuron can't "see" far enough to recognize an arbitrary chunk in the vocabulary.

Ergo, we conclude that encoding chunks as patterns of activation in a bus of fibers makes more sense than encoding them as one concept per fiber. Binary encoding usually -does- beat unary encoding :).

Once we've recognized any of our 10^8 or so items, we need to jam a result back in the same encoding if we're to close the state-cycle loop: This implies firing off a -set- of fibers. Looks like our old friend the sum-of-products PLA network, more or less, derived nearly as a matter of logical necessity.

We also need to ensure that one and only one result is being jammed on the output bus: This pretty much requires mutual inhibition between the different candidates to jam something on the bus.

And -that- gives us an architecture that will classify just about -anything- as one particular thing, which sounds a lot like what the brain does, and what AI systems tend not to do very well. Abstractly, we can borrow that nice state-space analytical paradigm, in which we are effectively dividing input space up into something like a Voronoi (?) diagram, according to which template point is closest to the input point -- which has the nice property that it is always possible to pick a closest point, breaking ties arbitrarily -- and then generating as result a corresponding new state.

Since we're insensitive to small changes of various sorts in the input vectors, we're "generalizing" to some extent; Alternately, thinking of the input vector as being sucked down to one of N fixpoints, we're doing deduction, suppressing noise and filling in missing input vector parts as we move the input vector to the closest template vector. In the right circumstances, this looks like "associative memory", recovering any single missing component of the input vector from the remaining parts, in other circumstances it looks like noise suppression by restoring a signal after a small perturbation.

It's logically (combinatorily?) necessary to deal with relationships between chunks, not just chunks in isolation: Abstractly, to assert/recognize tuples of chunks. This can be implemented by extending the vector, or dealing with time-sequences of chunks. Extending the vector seems to require duplicating the machinery for each concept once for each tuple slot it can appear in, and learning its separate avatars independently; This seems inelegant to me, but in a parallel machine, learning it twice should take no more time than learning it once, so maybe this isn't actually bad. Time-sequences may not actually simplify things... *fuzzypeer*. Analytically, it seems much simpler to just think of a longer state vector containing multiple chunks.

That's about as far as we've gotten at the moment: The need for the "sum-of-" part of the "sum-of-products" structure is today's teeny step forward, along with focussing on the need for mutual inhibition.

How does the re-encoding manifest in the statespace abstraction? *peer* Looks like it just means each template point has a pointer to some other point (might be in same space or a different space). We can think of learning as moving around the template vectors, then, together with moving around the associated output vectors. Nothing new. Hrm. We need to account for the creation and destruction of points, too.

How does the extended vector manifest in the abstraction? *peer* Well, if the "memory" part of the purely input matching process is to function, it would appear that we need to publish the template vector matched, not just the output vector.

It seems silly, but we seem to need the usual PLA latch and separate input and output busses: We have to have a stable input vector all the time we're letting a stable output vector form through mutual inhibition. If one looks at the brain as an open flow-through from sensory to motor cortex, with no real feedback (which I often do), this is automatic. Otherwise, one appears to need something like a two-bus loop with a two-phase clock... that would be weirdly wonderful to see actually happening in the brain!

Can we see far enough through the mist to somewhat model things vaguely like classic logic deduction rules? This basically means introducing something like free variables into our system: The ability to bind them from one tuple and use that to construct a new tuple.

*peer*...

Somehow we need to be able to deal with "John loves Mary" both as a brand-new datum we've never heard before, and also as a memorized fact subject to recall by feeding in "John loves ----" and getting back Mary.

HRM! That looks like a very constructive problem to focus on. We also need to be able to deal with "John loves Sue" being asserted when we had previously believed that "John loves Mary": In the extended-vector paradigm, that seems to be a time/way of introducing a new point. (I'd like new chunks to appear much the same way new tuples appear, if possible...?)

Another useful point to focus on: The logical formulation "John loves Mary" seems perhaps a bit deceptive... the verb role may be quite different than the noun role. ?

We also need to be able to believe both "John loves Mary" and "John loves Sue" as being simultaneously true sometimes, and mutually exclusive other times.

Lurking in the shadows here is the fact that memories can be strong or weak and beliefs likewise. We may have a dim recollection that someone maybe said John loves Sue, or we may have watched them mating like mink for hours on end... how does confidence in the accuracy of a memory manifest in the statespace model? One obvious try would be to introduce a "radius" associated with a template vector. But that doesn't seem right: Being more confident of a memory doesn't mean that we're increasing the fuzziness with which we identify John or such...??? In the neural model, it would seem most naturally either a property of how effectively we squelch competing assertions, or how 'strongly' we assert our output vector... strength of assertion being something we haven't really introduced. We make deductions on how confident we are of a fact, whereas we don't usually seem too aware of what the competing facts were that didn't get asserted. ? It would also not be right to make it a scalar property of the "love" chunk: We're scaling our confidence, not how much we think John loves her. We might be very sure indeed that he was only mildly in love with her.

We seem to be stuck just sticking a scalar intensity on the fact vector.

Hrm. How -do- we answer a question like "Who does Don love?" when the answer is "Mary, Sue, Peggy, um, Toni, Larissa..."? Humans aren't extremely -good- at answering such questions, be they -do- manage, and (significantly?) it doesn't seem to be some parallel process that produces at N answers in a constant-time KER-CHUNK computation, but something more like jittering the query and waiting to see how many different answers happen to drop out, one by one, following a law of diminishing returns: "... oh, and Lara, how could I possibly have forgotten her??!"

That seems to fit well with the model of memory storage as N discrete tuples/vectors, in a mechanism that really only delivers one at a time, in a process quite like the closest-point query model, in that a given query tends to just give one or two answers, and to get more answers, one must jitter the query vector a bit to get closer to some other point -- "associative memory", what an emptily apt phrase :).

It seems to especially fit well with the way we answer queries like "Can you think of anyone who has been to Paris and is in love with someone with a Prince Albert"? ... we just try to throw random, somewhat relevant stimuli in waiting for the mapping to pop out something matching the query, chaining along via largely irrelevant associations... there doesn't seem any evidence at all that the associations one -needs- are always magically efficiently indexed, just that associations -are- what is stored, and that the ones one uses frequently wind up efficiently indexed.

It seems that we may need the -radius- concept anyhow, however: This seems to correspond to the notion of a general rule with specific exceptions: Integers have multiplicative inverses, -except- for zero. Zero is a small, precise exception in a big receptive field. We don't seem to do anything like separately learning that integers greater than zero have inverses, and also those less than: We just cleanly punch one exception through a single generalization. Unless we're trying to solve the problem at the wrong level of abstraction, that seems to suggest that the template points aren't matched to the input just by a nearest-neighbor match, but by an actual comparison to a receptive field which does -not- have to be mutually exclusive with other template points, but rather which can overlap them arbitrarily: One has something more like a "confidence fall-off" function for each template point, and beyond some point the query just doesn't match it to speak of: The appropriate model is less "which point are we closest to" than "which spheres are we inside?". We can make them ellipsoids by allowing a different radius along each axis, of course... :).