AGI: CONNECTION BETWEEN VISION AND INTELLIGENCE
TO THE ATTENTION OF AI INVESTORS AND DEVELOPERS
Humans and all the most intelligent animals - mammals, birds, octopuses - have good eyesight in nature. This is not an accident: vision is formed by the need to obtain detailed information about the situation around you, and the brain is formed by the need to analyze this information and make valuable decisions for the individual. The evolution of vision and the brain as two components of one intelligent system results in closely integrating these components.
Two aspects are essential for understanding the role of vision for intelligent systems.
The first aspect - the intensity of the raw flow of information received through vision is large compared to the flow of information obtained through the second most crucial information channel - hearing. The difference is not just big - it is enormous: the intensity of the flow of visual information from the retina is about 100,000 times greater than the flow of acoustic information! Accordingly, a significant proportion of the natural brain's resources is explicitly used to analyze visual input.
The second aspect is the focus of vision on detecting and analyzing the dynamics of the observed external environment. The information value of changes in the observed scene is much higher than information about the static components of the scene. Everything that moves is a potential danger; for predators, moving is also a food source. Therefore, detecting moving objects is natural intelligent systems' first and most important task. The classification of things in the surrounding world is based primarily on assessing the behavior of moving objects: moving quickly or slowly, approaching or moving away, etc.; how similar a moving object is to those already known is a secondary matter and only sometimes necessary. When an object flies straight at us, we dodge it (or try to intercept it) before we can determine what it is - a stone or a crumpled piece of paper. In this case, the assessment of the situation depends on the detection of moving objects and the prediction of the motion trajectory.
In cases where the visual scene is static, the evolutionary process of creating "artificial dynamism" by moving visual sensors (along with the whole organism or by moving the head) is used. The analysis of changes in the visible scene allows you to divide the scene into objects without their identification (recognition) - which is especially important when the thing is partially hidden behind others - and to rank the set of observed objects by the distance to the observer (what is closer and what is farther) without using binocular vision with parallax analysis.
As a result, we see that the natural priority of the analysis of the visual scene is as follows:
detect moving objects
predict the trajectory of movement
identify an object using information about movement and size, shape, color, texture, etc.
Such a sequence provides the possibility of making rational decisions when it is impossible to identify an object, avoids wasting resources on identifying objects of no interest, and simplifies identification (when required) by indicating the location in the frame and the size of the thing to be identified.
Both aspects - high resolution and analysis based primarily on dynamics - have been developed in natural evolution as the most rational for the natural environment. Both aspects take place in mammals, birds, and octopuses, even though the structure of the brain and the nervous system differ significantly in size and general structure. This argument favors that intelligent systems designed to function in a natural environment should be intelligently designed by endowing them with vision and following these principles.
And how does it work in practice?
Publicly available information suggests that two approaches dominate among those used in practice, forming two non-overlapping groups.
One group consists of advanced automatic control systems that follow the natural order (starting with the detection of moving objects and ending with identification), but are not declared as AI systems, use other techniques along with visual sensors (or instead of them), and deal with a much lower resolution than the resolution of the human and animal eyes.
The second group consists of systems that are positioned as AI, use high-resolution video cameras, are based on the use of neural networks for processing visual information, and implement the reverse sequence of steps: first, identification, and on its basis, detection, and evaluation of the movement of objects. The specificity of using neural networks leads to the fact that some things obviously cannot be recognized, and, accordingly, the ability to intelligently respond to unknown objects in the environment is lost. One of the consequences of using such systems for autonomous driving of cars is accidents caused by a lack of response to unrecognized objects.
It is reasonable to assume that the next generation of AI systems will combine the positive aspects of the two groups described - using high-resolution visual information and its analysis, not based on object identification as a first step. This does not mean abandoning the use of neural networks for identification (when required). Still, it provides the ability to intelligently respond to unknown objects and expands the possibilities of identification itself, adding dynamic characteristics to the number of identification parameters.
For investors, this is a hint about what is promising at the seed stage; for developers, an indication of the usefulness of knowledge in the field of AI vision beyond approaches based solely on neural networks.
There is an interesting discussion: what kind of structure do we have in a computer that preserves the shape of an object, color, texture, and so on. And the answer is the spline structure used in digital cartoons and games. Thus, the question may arise, what are the algorithms for obtaining the same processing of splines that we and other animals with vision have for visual patterns.