Before continuing the analysis of ways to overcome the perception-concept gap, begun in the previous chapter AGI: PERCEPTION-CONCEPT GAP - WHAT IS POSSIBLE AND WHAT IS NOT, let us clarify the terminology we use and the essence of the problem under discussion.
In numerous recent publications devoted to the practical use of object detection and tracking technology, several statements are implicit:
Neural network is the primary tool
The neural network is "trained" before the operation stage (in fact, it is formed) based on a training data set prepared by people and including all types of objects that can, in one way or another, affect the system's operation
System capabilities do not change during operation
The movements of only those objects the neural network is trained to recognize are tracked
The obscurity of the listed features, firstly, voluntarily or involuntarily, forms the opinion that the problem of detecting and tracking objects has been solved in principle and all that remains is to correctly configure one of the available tools, and secondly, assumes that fundamentally different approaches, even if thinkable, are not needed.
The abovementioned aspects make the approach practically unsuitable in many cases. This is because if the present object does not fall into the set of types that make up the training set, the object cannot be recognized, and, accordingly, the ability to track its movement is lost. The same thing often happens if an object is presented in the training set but is only partially visible or in an unusual position.
Further consequences are the inability to detect unknown objects/processes and, accordingly, the impossibility of autonomous and permanent learning, which implies the generation of new concepts by the AI system - the essence of the discussed perception-concept gap. Note that the above remains true for those approaches based on unlabeled training data sets.
Acceptable results for this type of system occur in a restricted environment where the appearance of unknown objects is not allowed (production environment) and where the presence of unknown objects is not significant (for example, assessing the intensity of road traffic). In cases where the appearance of unknown objects can be substantial (automatic driving, military systems, robotic surgeons), other solutions are required.
As we can see, the ability to track the movement of arbitrary objects, detect unknown objects, and generate new concepts based on sensory data are all aspects of the same ability, the absence of which we call the perception-concept gap. Finding options for eliminating the perception-concept gap is one of the most essential tasks on the path to development towards AGI (Artificial General Intelligence).
The first discussed way to solve this problem, proposed by AI/AGI developers, is cluster analysis.
In a brief formulation, this seems reasonable: cluster analysis divides the set of objects being classified into groups of similar ones. Each cluster may correspond to a specific concept.
The first problem we encounter with this approach is that over time, the set of classified data from sensors changes - either expands (if new data supplements existing ones) or changes in composition (if a new piece of data replaces the oldest data). Clustering has no guarantee that when a new object is observed, the new set of clusters will contain the same set of clusters (perhaps with slightly changed parameters) plus a new cluster corresponding to the new object. For many clustering methods, the number of clusters is specified by the user (directly or indirectly) - such options are unsuitable because whether a previously unknown object appears should be the result of the analysis and not the input parameter. This means that some new clustering algorithm is needed rather than using one of the known ones.
The second problem is that an element of the set of clustered objects is a frame (an array of pixels or voxels) corresponding to the observed scene (at a specific time point). Clustering implies the presence of a quantitative criterion for the similarity of two elements of the clustered set, that is, in this case, two frames. This criterion should have the same value for cases when the frame contains one object of a particular type or several ones, whether the object is located in one place of the frame or another, in one orientation or another, of the same size or another, otherwise for a particular type of objects we get a lot of distinct clusters - for different numbers of objects, different positions, etc. The construction of such a practically applicable criterion is a task that is at least non-trivial (and possibly unsolvable).
The two described difficulties of using clustering as a method of preception-concept gap elimination are not the only ones (readers, after reflecting on the problem, will probably find others). This, however, is enough to make it clear: clustering as a way to overcome the perception-concept gap is a possible direction for search but not a tool ready for use in practice.
Ok, so I think clustering is the tool to use and is highly practical when done correctly.
So your first problem: "number of clusters is specified by the user" is not an issue. Instead of thinking of each cluster as a high level "object" think of the clusters as micro features that make up the high level objects. So what we specify when define the number of clusters is the resolution of our understanding. Just as we can take an image with a 64 pixel camera where each pixel is only white or black (one bit per pixel) or a 100 megapixels color camera, with 24 bits of data per pixel both images are able to represent the concept "A" (the shape of the letter A drawn on paper).
If I have a clustering algorithm that takes raw data and produces only 64 bit output space (64 clusters) it doesn't mean the system won't be able to recognize and "understand" more than 64 patterns. So say I trained it on black and white hand written letters to try and recognize symbols, but asked it to cluster samples that makes up 1000 different symbols, the clustering algorithm should still be able to identify common micro features that when combined would easily identify more than 64 symbols. It would create 2^64 different combinations of "features" and could identify up to 2^64 different symbols (best case). But since there were only 100 we we trying to identify most of the symbols would only cause a few micro features to be active to identify them.
So if we use clustering to define micro features, and use down stream learning to translate these micro-features into behavior, (such as a behavior of saying -- that's an A or that's an X, etc) the behavior space can be much larger than the tiney 64 independent cluster space.
So don't think of the size we specify as the number of clusters as limited the macro concepts, think of as how much resolution we have in defining concepts.
"The second problem is that an element of the set of clustered objects is a frame"
This too is not the show stopper you are thinking it is. At least I don't believe it is. Using a hierarchical clustering system (feature extraction system), each layer in the network will represent different forms of the encoded concept. Just as we see visual pixels being transformed into "edge" concepts, which translate higher into shape concepts, and many layers higher into say, "cat" concepts, the network will be able to encode smaller more detailed information at the lower levels and large concepts as the higher levels.
So, at a lower level, we could end up with a concept of "unknown object moving to the right, located in the top right corner of our visual field", vs "unknown object moving to the left, in the top right corner of our visual field", But at a higher level it ends up with a concept of "cat in the room over in the corner" and "moving right", as different concepts. The feature system needs to translate specific features like where it is our visual field, to more abstract of "cat in the room over in the corner" where it maps from the 2D space of our visual field to a more complete map invariant features of our current total environment.
I believe all this is possible with a generic/simple hierarchical clustering algorithm. Selecting the number of "clusters" is really just same as deciding how many neurons you want to use in your neural network, which just ends up defining the resolution of your understanding system just as what happens when we decide how many pixels to use in our camera.
The AGI problem is one of how to build/learn a small set of useful behaviors based on an environment that is many orders of magnitude too complex toundestand with a machine as small as the brain is. So what the brain does, I believe, is extract the most predominant features it can find, and drives our behavior with that. We only "see" and "understand" the very tip of the iceberg of the true complexity of our environment. We are blind to 99.99% of what is around us. But that .01% we have access to, is enough to create all the complexity of behavior we see in humans.