The second way to overcome the perception-concept gap of the three listed in AGI: PERCEPTION-CONCEPT GAP proposes to look for patterns in sensory data as a way to detect objects, including previously unknown ones, and thereby eliminate the perception-concept gap.
Sensory data reflecting a situation in the natural environment is usually a sequence of frames, where each frame is an array of pixels or voxels. If other data is available along with this, the approach remains the same.
The discussed approach assumes that the same object (or a type of object) forms similar patterns in different frames. Detection of a known pattern corresponds to the identification of a known object, and a new pattern indicates the presence of a hitherto unknown object.
Building algorithms for comparing patterns (during identification) and searching for unknown patterns requires solving several subproblems.
Firstly, differences in illumination, distance to an object, and its location and orientation mean different values of brightness and color of frame pixels. This means that it is not the original sensory data that should be compared but their normalized fragments (that is, reduced to a certain standard resolution, orientation, brightness, etc.). Since the desired pattern is a frame fragment, testing a series of positions of the potential pattern with different frame fragments is required, which significantly increases the complexity of the process but is not a fundamental obstacle. The fact of identification of a known object is a sufficient level of similarity of the test fragment with a known remembered one. Constructing a similarity criterion is a separate task, which we will not touch upon now.
Secondly, you need to decide what is being compared to what when it comes to discovering new pattern objects. At first glance, there are no particular problems: having two frames, we go through the normalized fragments of these frames and look for reasonably similar pairs that are not contained in the set of known objects. The process is computationally intensive, but as with identification, this is not a fundamental obstacle (in particular, because the normalization of fragments can be chosen quite coarsely).
A key aspect of finding new patterns is selecting pairs of frames for comparison. One of them is obviously the last one received. Which frame should we choose as a pair for comparison? The penultimate frame (or one of the previous ones) is clearly not suitable since it differs little from the first; such a choice would mean that almost everything in a pair of frames will be the same, and nearly all of this frame will be perceived as an object of a new type. As a result, the set of patterns will simply become a significant part of the available set of frames (and not the objects in them), which obviously does not correspond to the original goal. Masking known objects on both frames of the compared pair (excluding the part of the frame occupied by them) will mean that the unknown object, based on the comparison of two adjacent frames, will be everything except the mask, which is also unacceptable.
This means that for comparison to find new patterns, pairs of frames should be selected that are sufficiently distant in time - then if both frames have the same object and the rest are different, the goal of finding a pattern and generating a concept based on it can be solved. To implement such an algorithm, it is naturally necessary to remember quite a lot of frames, that is, to store a large amount of sensory data. This increases the requirements for the required resources - a large amount of data must not only be stored but viewed many times to detect patterns. But this is, again, not a fundamental problem.
However, there is one more snag. The original formulation implicitly assumes that a new object is detected immediately upon its appearance, which allows an immediate response to the current situation (a rhinoceros comes into the field of view of the car autopilot, the autopilot detects it as something hitherto unknown, and takes the necessary actions, avoiding an accident). But any algorithm for searching for patterns, that is, repeating combinations of certain elements, fundamentally does not allow this to be realized since only the repeated appearance of an unknown object (rhinoceros) in the field of view allows one to detect a pattern and thereby detect the presence of an object. At its first appearance, a potential new object is “invisible”, since there is no pattern yet, and the time interval between frames that ensure the detection of the pattern, as we already know, must be significant.
SUMMATION
Detection of new types of objects by searching for patterns is possible in principle but requires enormous computing resources and storing a large amount of sensory data.
Along with this, there is a fundamentally inescapable disadvantage - the impossibility of immediately reacting to the appearance of an unknown object in the field of view since the object must appear at least twice in a different environment for the pattern to be detected (which allows the environment to be separated from the desired object). The latter circumstance makes this approach unsuitable for systems where an immediate response to the appearance of unknown objects is a necessary condition.
Merging patterns together based on similar data patterns is not what the brain does.
How does the brain classify. "dog bark" as a concept when the auditory data of the bark has nothing similar to the visual data patterns of what a dog looks like when they bark? The brain will classify these auditory and visual patterns together as the same concept, even though there is no shared patterns in the data. (I do know the answer to this). But data similarity is not the right answer.
Everything is relative in perception. Sizes are relative, brightness is relative, quantities (counts) are relative, positions are relative, etc. There are no way to memorize absolute values. So the best way to represent a pattern is not a "normalized fragment" but a combination of relative values. Then absolute sizes, brightness, total quantities etc can all change but the relative values of the parts of an object for the same type of object will remain the same. To capture relative values represent them as ratios. Ratios can be re-recognized because they are discrete / symbolic, provided you use a particular resolution which ideally is determined by the Just Noticeable Difference as discovered by Weber & Fechner in Psychophysics.