AGI: PERCEPTION-CONCEPT GAP
THE PRINCIPLE UNDERLYING THE SEPARATION OF THE OBSERVED SCENE INTO OBJECTS WITHOUT IDENTIFYING THEM
As a reminder, the current series of chapters discusses ways to bridge the PERCEPTION-CONCEPT GAP, the essence of which is the lack of workable approaches that provide an AI system in real-time with the ability to detect hitherto unknown objects in the environment and promptly respond to their appearance (including the creation of a new concept, if this is considered beneficial).
The previous chapter, WHAT ACTUALLY MEANS "APPEARANCE OF AN UNKNOWN OBJECT" discussed converting the expression "appearance of an unknown object" into engineering language, that is, ultimately into a description of what it means using technical terms. Now it's time to describe the principle that allows you to achieve the goals formulated in engineering language.
The natural environment is three-dimensional, and the lion's share of environmental information comes from vision. This visual information arrives as a two-dimensional projection of the observed three-dimensional scene. Naturally, such a 2D projection depends on the position of the video sensor (eye or video camera) relative to the observed objects. This allows us to formulate the desired principle:
Two-dimensional projections of various objects change their relative position in the frame when the spatial position of objects and/or the camera changes.
The effect of changing the relative position is easy to observe by looking at the landscape outside the side window of a moving car:
Closer objects (trees) appear to move relatively to more distant ones (buildings) in the direction opposite to the car's movement. Naturally, we observe the relative movement of projections and not actual objects - but this is precisely what allows us to separate objects from each other, regardless of whether we can recognize these objects (that is, whether the type of objects is familiar to us, whether we can classify objects).
In particular, this makes it easy to distinguish a landscape painted on a wall from a real one by moving relatively to it: in a painted one, there is no relative movement of the projections of drawn objects, and the whole picture is perceived as a single specifically painted object.
Less noticeable, but quite easily detectable, is the visual effect of the vertical displacement of the visual sensor when walking: closer objects "move" slightly relative to more distant ones (and the horizon) up and down with each step. The scope of these oscillations allows our innate subconscious system for analyzing the visual scene to rank objects by distance - what is closer (the scope of oscillations is larger) and what is further away (the amplitude is smaller).
Suppose the camera and the objects of the observed scene are moving. In that case, the ability to separate objects by analyzing changes in relative position on the 2D frame is retained.
The ability to separate is lost in specific (singular) cases. The movement of a projection in the frame caused by an object's actual movement may coincide with another object's movement caused by the camera's movement. This is a relatively rare situation, but when it occurs, it can cause optical illusions in humans (for example, an airplane appearing to be "hanging in the air" when observed from a moving car).
If objects are in close contact and do not move relative to each other, they are perceived as one composite object. Since, in practice, a quick reaction is necessary to respond to the movement of objects, such an interpretation is not a hindrance.
The advantage is that the use of this principle allows you to perceive parts of an object as a whole when it is observed behind a lattice obstacle or illuminated by "striped" light:
Of course, situations where the logical object is not a physical object, such as inscriptions or symbols on road signs, are treated as one object with the carrier; the analysis of such is facilitated by the fact that the separation of the carrier object makes the zone of subsequent analysis known.
Critically-minded readers have probably noted that when the relative position of objects in space changes, not only the relative position of the projections changes but also the projections themselves (orientation, shape, size). This is indeed true, but practice shows that, firstly, small changes in position correspond to completely measurable changes in the relative position of the projections with a subtle change in the projections themselves; secondly, the realizable changes in projections correspond to the known mathematical (geometric) dependencies between objects and their projections, and this must be taken into account in the analysis (for example, a change in distance means scaling of the projection).
Thus, using the discussed principle requires observation of the process (at least two video frames are involved in the analysis) rather than examining a single image. The ability to separate the visual scene into objects is acquired, regardless of whether they can be identified (recognized).
The described specificity of vision concerning the properties of three-dimensional projections has, naturally, been known for a long time. Moreover, this specificity has long been used to measure distance based on parallax measurements. Accordingly, a natural question arises - why is this approach not used to separate objects in the observed scene? One of the reasons is the relatively high complexity of implementation, requiring analysis of changing situations. Another reason is that for many applications, it is sufficient to operate only with a predefined set of object types and the availability of ready-to-use tools for this. When it now turns out that this is not enough in practice, a corresponding request arises for a method to overcome the perception-concept gap.
Details of the software implementation of the principle will be given in one of the following chapters.
Right. So from the time dimension we get classification of objects by how they change over time. Which is why my network (that we have been chatting about) uses time correlation as the key foundation of classification (things that correlate in time get grouped together to form features (what you are calling objects)).
How does your system make use of stereo vision which also allows us to classify objects without motion? In theory, but not proven, my network based on statistical correlations will decode both stereo vision information as well as your relative motion information to identify unique features.
And here's a real tricky one for you. Even though we learn about objects in the real world with their relative motion that you talk about, our perception network can look at a static picture and still parse what the objets are. It has no motion at all because it's just a photo or painting or sketch but yet our subconscious perception system still parases it into objects and gets it mostly correctly most the time? If motion defines objects, how is our perception system able to parse a static image into objects correctly?
This would include, for example, a painting of a scene of a road with a alien animal on it that no one has ever seen before. Yet everyone could look at the painting (or a real photo) and see there was a big animal standing in the middle of the road. Our perception system would parse the static image very accurately even though it had never seen any object of this type, and even though there was zero relative motion in the image to indicate to us where the object boundaries are.
It's because we don't just use motion to parse new objects. We use many different clues. We use stereo vision when we can see things in real life with two eyes. In a picture, we use shadows. We recognize the edges formed by the boundaries of color shapes., We use our understanding of gravity to make us assume the animal must be touching the ground, and not floating in the air, which tellus where on the ground it must be "sitting" and given it's location on the ground and our innate understanding of perspective, we get a good idea of its size. We use depth of field with focus to measure distance. So if a painting captures the focus variation, that tells us more about location of objects in 3D space. We use our understanding of the other objects in the scene to establish context (how much room is there on a road to allow an animal to be located in that space). All of this sort of information is parsed in milliseconds by our perception system and we just "see" the object in the painting without having to wait for the brain to do a bunch of number crunching and without us having to logically reason about what we are seeing with our language skills.
And if I see 100 photographs of some new object, my brain learns about the object, even though I've never seen one in real life, and even if I've never seen a video of it so I have no motion data to create the concept of this new object.
In order to use all these clues, and more than we don't think about, you need a generic statical approach to the identification of "objects" in raw sensory data. And when you get the statistics correct, you don't just get visual objects being defined, you get auditory objects defined (like words, or a cat meow, or a dog bark), and these things end up with locations in 3D space even though tyey are just a sound. How doe sounds get separated into "objects" and assigned a location in our 3D space map if we can't use the relative motion data you talked about in this article?
Either we have a million different evolved "tricks" created by evolution with different brain circuits to do all of this and more, or the real trick evolution came up with, was one generic algorithm for doing all of it. I happen to believe the latter is the truth. I think my perception algorithm might be that trick (or very close to it if not the full thing) but I need to spend a lot of time to do the work to study what it really can do, and can't do to know the full extend of its power (or total lack of power if that's the case).
It might be better to think of the visual environment as a 4 dimensional (add time) entity, and Vision provides 2 3D projections. This is a better concept because it allows the movement of the observer and objects to be directly incorporated.