5 Comments

Merging patterns together based on similar data patterns is not what the brain does.

How does the brain classify. "dog bark" as a concept when the auditory data of the bark has nothing similar to the visual data patterns of what a dog looks like when they bark? The brain will classify these auditory and visual patterns together as the same concept, even though there is no shared patterns in the data. (I do know the answer to this). But data similarity is not the right answer.

Expand full comment

"Pattern" is a very broad concept, which accordingly can have different implementations.

If a visual scene description is combined (by “data fusion” process) with audio information into one "data frame", then the pattern can include both dog visual representation and dog audio one.

As for the natural brain, in my opinion, in different situations and at different phases of information processing, different brain structures use all known methods of information processing - plus some as yet unknown :-)

Expand full comment

Sure we can create algorithms that cluster audio and visual data patterns to define "sameness" by putting them "in the same frame" as you say. But the important question remains of how different patterns are selected to be combined.

If I have a visual data pattern of 100010101111001. and a audio data pattern of 001001011000101. why would the algorithm chose to combine them together for form a concept of "sameness"?

If the patterns were 10001010 and 1001010 we see some common "samness" at work in the data itself, the 001010 pattern is shared. But this concept of "sameness" is not useful.

So my point was that matching sensory data together due to some patterns in the data itself, is not useful. At least in visual pixel data from a camera we can match 2D patterns across different parts of the sensory and there are reasons to do that (to detect motion of a shape across the field for example). But looking for the same data patterns in two different modalities has no use -- because how the data is encoded defines the patterns you might match and in that case you are matching based on sensor encoding and not matching on features of the external world.

So to be clear how I'm thinking about all this.

The point of "combining data" together is that sensory input data on AGI systems needs to be many orders of magnitude larger than the effector output data in bandwidth. Our brain has far more sensory data from eyes and ears and touch flowing into it, than the very tiny outflow of effector control signals it ends up producing to make our muscles activate. There is an obvious data compression/reduction required to get from high bandwidth sensory data to low bandwidth effector data.

But at the same time, the point of the sensory data is to establish context for learning and context for behavior. Which means it acts as the "address" in a big behavior lookup system where different sensory patterns, trigger different learned behaviors. We learn that when we are in the context of driving a car, our behavior is very different than we are sitting at a table eating a meal. The sensory data from the enviornment defines the context of the behavior we produce.

But since there is a massive data reduction involved in the translation of sensory context data to effector output data, there must be a lossy data combination process at work. 1000s of different sensory input patterns, will generate the same, single, behavior output pattern. These 1000s of different sensory input patterns must be "clustered" together to create the one single output behavior (like extend my index finger).

So to create AGI, we must build this data clustering algorithm that defines that sensory input patterns X, Y, Z, (plus 1000 more) etc are all clustered together to trigger "straighten index finger on right hand). How does this work? How does the brain do this clustering? Is it one or multiple algorithms?

Each neuron in the brain acts as a pattern cluster "detector" in that it will fire, when it sees any of the many patterns of upstream activating on it's synapses. All these micro level pattern detectors are connected in a large complex network that forms one big macro level clustering system that maps raw sensory inputs, to effector outputs in a massive funneling (clustering) effect.

In your perception concept gap, I see the issue as being the fact that our algorithms are not clustering the data the same way the. brain does, and as a result, most our AI systems act in strange ways, because it doesn't cluster the same way the brain does.

So in this article, you talk about clustering by "patterns" to define "objects". And you talk about the idea of defining "sameness" as "similar patterns". So for example we see a "square" as a pattern of edges oriented at 90 degree and forming a closed shape and we can write code to try and identify "squares" in a visual image, or "moving squares" in multiple images. And using logic like this we can produce interesting useful code. Like a self driving car that can spot a stop sign by it's pattern shapes and colors.

But this is not how how the brain works. It doesn't cluster by similar patterns.

So since we have a learning brain, it's mapping from sensory to effector data must be learned. So that's some sort of RL algorithm that figures out that mapping and evolves it over time. But to map directly from high dimension raw sensory data to effector outputs using RL is totally impossible due to the curse of dimensionality. The learning space is so large that the amount of data required to train it and the amount of time it takes to experience the data in real time, exceeds the age of the universe to learn what a child learns in a day. So though RL tells the mapping what to do, it can't on it's own solve the data reduction problem.

In more concrete terms, RL teaches us how to react to a cat, but it doesn't build the pattern detecting network that can turn raw sensory data into the signal that drives learning that indicates there is a cat in the room. Millions of pixels must be reduced by some clustering algorithm to a single pixel with the meaning of "cat" for example. This is the perception problem and the perception problem is not solved by RL. It's solved by using unsupervised learning on the raw data itself. The data tells us how to classify it, not an external reward signal.

And when we write code to translate raw pixel data into a "stop sign" signal, we have hard coded our idea of a stop sign into 1000's of lines of code to do that data reduction of turning millions of input pixels into a single bit that tells us "yes there is a stop sign in this image" or no there is no stop sign in the image.

This is an old and long standing problem in writing "smart" code. Even when we use generic learning systems to train a network how to react to the inputs, we seldom use the raw sensory data as inputs, we instead pre-process it to something simpler that fits the needs of whatever application we are trying to apply the "smarts" to.

But to solve AGI, we can't hard code the perception system. When we try to do that, we get the problems you talk about in these series of posts. Important "objects" are not recognized and then the self driving car runs into the rhino in the middle of the road like it wasn't there at all. The lossy compression perception system took the information about the rhino and threw it all way with the bath water.

So how do we generic lossy compression and not throw the baby out with the bath water if we don't know what is "important" to the problem? Is it even possible? The answer is yes it's possible, because I figured out how to do it.

Sensory data has large amounts of reducency in it, and by using information maximising we can translate the redundant data with lots of correlations between data points to a format with very low correlations and maximum information. This sort of preprocessing of the sensory data is required to making learning faster and easier. When when you do it correctly, it can explain how humans learn so fast.

The key thing that most the world of AIG work is missing, is that they are not using temporal correlations to do this data reduction with.

The brain, defines "object" as the set of all sensory patterns that are highly correlated, ACROSS TIME. That is sensory patterns that tend to happen close together in time, need to be clustered together to define objects.

So how does the above idea help an AGI learn to recognize an object like a cat? Whjen a cat walks into a room (two just happened to walk into my room), their presense in area causes a stream of sensory patterns to flow into my brain. I catch sight of the cats moving in the room. I hear them meow. One cat just went to the food dish and made sounds crunching the dry food and it's dog-tag on it's collor just clinked against the metal dish as it was eating. These are all sensory data patterns tht happened becuase there was a cat in the room, and they all happened within seconds of each other, becuase the cat was in the room creating these sensory patterns.

The brain, can merage all these different visual and auditory patterns together into the same "cat detector" network to form the "cat" object detector. And it all can happen, simply beuase these patterns tend to show up close together in time. When I see a cat, the odds of a "cat eating" sound patterns become high for example. These different auditor patterns are correlated ACROSS TIME.

So the correct way to define patterns is to cluster BY TIME. Not to cluster by "similar data patterns". The brain doesn't need to learn to "normalize" data say by rotation and scaling as we do in visual graphics. That stuff all happens AUTOMATICALLY if you just cluster by TIME. Patterns that show higher corelations in time, are combined into the same clustering group to define "objects" or "features" or, the output of the perception system.

Clustering by time is all you need to create perception which is the problem of producing a useful data reduction system that translates high volume raw sensory data into compressed, small bandwith, context feature patterns to drive behavior learning.

If you cluster by time, correctly, there will be no concept gap because that is how the brain defines objects (and concepts, and things). Whatever you want to call it, it's how the brain does perception clustering.

Expand full comment

Everything is relative in perception. Sizes are relative, brightness is relative, quantities (counts) are relative, positions are relative, etc. There are no way to memorize absolute values. So the best way to represent a pattern is not a "normalized fragment" but a combination of relative values. Then absolute sizes, brightness, total quantities etc can all change but the relative values of the parts of an object for the same type of object will remain the same. To capture relative values represent them as ratios. Ratios can be re-recognized because they are discrete / symbolic, provided you use a particular resolution which ideally is determined by the Just Noticeable Difference as discovered by Weber & Fechner in Psychophysics.

Expand full comment

Right. So normalization applies to raw sensory data; such normalization allows to correctly compare fragments of raw data from different frames (that are affected by automatic adjustment by camera device and so on). After that, relative values (required for pattern detection) are calculated using normalized "absolute" values.

Expand full comment