The third and final way to overcome the perception-concept gap listed in AGI: PERCEPTION-CONCEPT GAP - search for image features in video frames and use them to construct new concepts when unknown objects appear in the field of view. To illustrate the approach, we borrowed an illustration from an article mentioning this approach:
Concepts is All You Need: A More Direct Path to AGI.
Peter Voss, Mlađan Jovanović
https://arxiv.org/abs/2309.01622v1
Using features instead of raw pixel/voxel array seems rational for at least two reasons: the amount of data is significantly reduced, and finding features can be done using existing tools.
For further analysis, it is helpful to remember that the essence of eliminating the perception-concept gap is to develop methods for detecting hitherto unknown objects, and this detection must be fast enough to ensure the possibility of an adequate response to the appearance of these objects. Obviously, no approach based solely on preliminary training in recognizing a predetermined set of objects can provide this.
Identifying all known objects in the current frame can reduce the amount of analyzed data (data relating to everything known is excluded). Still, it cannot answer whether there are unknown objects, among other things, how many there are and how they are located in the frame.
Returning to the approach of replacing the original data with a set of features, we find that the set of features describes, like the pixel/voxel array, the entire observed scene, which in the general case is a combination of several known objects, several unknown ones, and other things that play the role of the background. It is clear that considering everything observed except recognized objects as some new unknown object is not a solution; if the system does not have the knowledge to identify birds, dogs, cats, and fish (right part of the picture below), then their appearance in the frame, in this case, would lead to the appearance of an object “bird+dog+cat+fish,” different from cat, dog, bird, and fish, both individually and in various combinations.
Conversion of the raw sensory data into the set of features does not solve the problem of perception-concept gap - only the form of representation of sensory data changes. This means that using features is not an alternative approach to solving our problem but only an alternative representation of sensory data. Having received a set of features, we return to the original problem of separating individual objects from the description of the observed scene (set of features). The conjugate (dual) formulation sounds like the problem of assembling features into groups corresponding to individual objects - including unknown objects for which there are no samples for comparison.
At the same time, the positive effect - reducing the amount of data describing the observed scene - has, as usual, the other side of the coin. Some of the information present in the pixel representation of sensory data is inevitably lost. In particular, one should expect problems in detecting and identifying objects that have blurred contours and/or small angular dimensions if the set of features used does not include specific options for these situations.
Particular mention should be made of segmentation as a variant of representing the observed picture with a set of segments. A segment is a type of feature; segmentation does not exclude the use of other features. Unlike different varieties, segmentation does not lead to a radical reduction in the volume of transformed sensory data.
SUMMATION
As a result of analyzing the three initially designated ways to overcome the perception-concept gap, we have only two (clustering and pattern search) and two options for representing sensory data - raw pixel/voxel array or a set of features built on this basis.
At the same time, the use of clustering, if possible, is only based on some new ideas that remain to be found. The method using pattern search is workable but demands ф computing resources and fundamentally does not allow for an immediate response to the appearance of unknown objects.
This analysis result can hardly be considered a surprise: this area is being intensively developed, but there are yet to be ready-made solutions; our analysis explains why this is so.
Meanwhile, it is evident that people and animals can detect unknown objects, and I do this so quickly that an adequate response to their appearance is ensured. Analysis of visual information processing in nature is helpful for finding new approaches; the next chapter is dedicated to just that.