Ok, so I think clustering is the tool to use and is highly practical when done correctly.
So your first problem: "number of clusters is specified by the user" is not an issue. Instead of thinking of each cluster as a high level "object" think of the clusters as micro features that make up the high level objects. So what we specify when define the number of clusters is the resolution of our understanding. Just as we can take an image with a 64 pixel camera where each pixel is only white or black (one bit per pixel) or a 100 megapixels color camera, with 24 bits of data per pixel both images are able to represent the concept "A" (the shape of the letter A drawn on paper).
If I have a clustering algorithm that takes raw data and produces only 64 bit output space (64 clusters) it doesn't mean the system won't be able to recognize and "understand" more than 64 patterns. So say I trained it on black and white hand written letters to try and recognize symbols, but asked it to cluster samples that makes up 1000 different symbols, the clustering algorithm should still be able to identify common micro features that when combined would easily identify more than 64 symbols. It would create 2^64 different combinations of "features" and could identify up to 2^64 different symbols (best case). But since there were only 100 we we trying to identify most of the symbols would only cause a few micro features to be active to identify them.
So if we use clustering to define micro features, and use down stream learning to translate these micro-features into behavior, (such as a behavior of saying -- that's an A or that's an X, etc) the behavior space can be much larger than the tiney 64 independent cluster space.
So don't think of the size we specify as the number of clusters as limited the macro concepts, think of as how much resolution we have in defining concepts.
"The second problem is that an element of the set of clustered objects is a frame"
This too is not the show stopper you are thinking it is. At least I don't believe it is. Using a hierarchical clustering system (feature extraction system), each layer in the network will represent different forms of the encoded concept. Just as we see visual pixels being transformed into "edge" concepts, which translate higher into shape concepts, and many layers higher into say, "cat" concepts, the network will be able to encode smaller more detailed information at the lower levels and large concepts as the higher levels.
So, at a lower level, we could end up with a concept of "unknown object moving to the right, located in the top right corner of our visual field", vs "unknown object moving to the left, in the top right corner of our visual field", But at a higher level it ends up with a concept of "cat in the room over in the corner" and "moving right", as different concepts. The feature system needs to translate specific features like where it is our visual field, to more abstract of "cat in the room over in the corner" where it maps from the 2D space of our visual field to a more complete map invariant features of our current total environment.
I believe all this is possible with a generic/simple hierarchical clustering algorithm. Selecting the number of "clusters" is really just same as deciding how many neurons you want to use in your neural network, which just ends up defining the resolution of your understanding system just as what happens when we decide how many pixels to use in our camera.
The AGI problem is one of how to build/learn a small set of useful behaviors based on an environment that is many orders of magnitude too complex toundestand with a machine as small as the brain is. So what the brain does, I believe, is extract the most predominant features it can find, and drives our behavior with that. We only "see" and "understand" the very tip of the iceberg of the true complexity of our environment. We are blind to 99.99% of what is around us. But that .01% we have access to, is enough to create all the complexity of behavior we see in humans.
Ok, so I think clustering is the tool to use and is highly practical when done correctly.
So your first problem: "number of clusters is specified by the user" is not an issue. Instead of thinking of each cluster as a high level "object" think of the clusters as micro features that make up the high level objects. So what we specify when define the number of clusters is the resolution of our understanding. Just as we can take an image with a 64 pixel camera where each pixel is only white or black (one bit per pixel) or a 100 megapixels color camera, with 24 bits of data per pixel both images are able to represent the concept "A" (the shape of the letter A drawn on paper).
If I have a clustering algorithm that takes raw data and produces only 64 bit output space (64 clusters) it doesn't mean the system won't be able to recognize and "understand" more than 64 patterns. So say I trained it on black and white hand written letters to try and recognize symbols, but asked it to cluster samples that makes up 1000 different symbols, the clustering algorithm should still be able to identify common micro features that when combined would easily identify more than 64 symbols. It would create 2^64 different combinations of "features" and could identify up to 2^64 different symbols (best case). But since there were only 100 we we trying to identify most of the symbols would only cause a few micro features to be active to identify them.
So if we use clustering to define micro features, and use down stream learning to translate these micro-features into behavior, (such as a behavior of saying -- that's an A or that's an X, etc) the behavior space can be much larger than the tiney 64 independent cluster space.
So don't think of the size we specify as the number of clusters as limited the macro concepts, think of as how much resolution we have in defining concepts.
"The second problem is that an element of the set of clustered objects is a frame"
This too is not the show stopper you are thinking it is. At least I don't believe it is. Using a hierarchical clustering system (feature extraction system), each layer in the network will represent different forms of the encoded concept. Just as we see visual pixels being transformed into "edge" concepts, which translate higher into shape concepts, and many layers higher into say, "cat" concepts, the network will be able to encode smaller more detailed information at the lower levels and large concepts as the higher levels.
So, at a lower level, we could end up with a concept of "unknown object moving to the right, located in the top right corner of our visual field", vs "unknown object moving to the left, in the top right corner of our visual field", But at a higher level it ends up with a concept of "cat in the room over in the corner" and "moving right", as different concepts. The feature system needs to translate specific features like where it is our visual field, to more abstract of "cat in the room over in the corner" where it maps from the 2D space of our visual field to a more complete map invariant features of our current total environment.
I believe all this is possible with a generic/simple hierarchical clustering algorithm. Selecting the number of "clusters" is really just same as deciding how many neurons you want to use in your neural network, which just ends up defining the resolution of your understanding system just as what happens when we decide how many pixels to use in our camera.
The AGI problem is one of how to build/learn a small set of useful behaviors based on an environment that is many orders of magnitude too complex toundestand with a machine as small as the brain is. So what the brain does, I believe, is extract the most predominant features it can find, and drives our behavior with that. We only "see" and "understand" the very tip of the iceberg of the true complexity of our environment. We are blind to 99.99% of what is around us. But that .01% we have access to, is enough to create all the complexity of behavior we see in humans.