Any engineering development can be viewed as a sequence of decisions, each of which is the choice of a variant from a certain number of possible ones. The result, of course, depends on how successful the choice was at each step.
In practice, the depth of analysis when choosing a variant varies widely. In one case, experiments and/or research may be required - the choice becomes a separate subtask of the project. In other cases, the choice is quite obvious. For example, before the invention of the steam engine, the sail was the obvious choice for ocean-going ships, and pneumatic tires are the obvious choice for bicycles today.
In the case of an obvious choice, the specific aspect is the almost complete lack of analysis of the consequences of the selection: if the decision is obvious, what is the point of spending resources on research? The decision not to spend resources on impact analysis is also an example of an obvious decision!
However, the history of technological progress demonstrates such a feature: innovations appear because there are those who begin to analyze in detail the consequences of obvious decisions and look for alternatives with which hitherto obvious solutions can be compared. However, we are interested in AI; therefore, we will find an example of obvious solutions here, specifically in object tracking. This task is very relevant for developers of robots, security systems, car autopilots, and military systems. And an example of an obvious solution is using a neural network as a fundamental element of the visual tracking system. The obviousness of such a decision is quite natural: there are ready-made developments, objects are successfully recognized by neural networks, competitors use this approach, and there are no apparent alternatives.
But if we nevertheless analyze the consequences of such a decision, fascinating effects are revealed. They are due to the neural network's functionality: recognition of visible objects from the set of object types on which the neural network is "trained." The algorithm looks quite obvious: we use a neural network to analyze the next frame - we find the location of known objects in the frame. We compare the position of each object on the current frame with the place on the previous frame and thus track the movement of objects in the system's field of view. Everything is simple and logical, isn't it?
Based on the practice of using this approach, the first doubt is that when an object that is not included in the training dataset is encountered, there is no way to track its movement. But this looks quite obvious, natural, and understandable: if something has not been taught, it cannot be demanded; you just need to retrain it.
The next point is the presence of stationary objects in the field of view. "Tracking the movement of stationary objects" sounds like a joke - but our solution does not allow otherwise because it is possible to detect movement with this approach only at the last step of frame analysis.
Another consequence is that recognizing objects in the current frame means only determining the object type and its position in the frame. And, of course, there can be several objects of the same type in the frame. Therefore, it is necessary to find a correspondence between individual objects in adjacent frames to track the movement. And for this, it is not enough to analyze the position since the "frames" that determine the location of objects can overlap, and the objects can "swap" in this case. In this case, the time of searching for a match between objects on adjacent frames depends quadratically on the number of objects of the same type.
Everything described above is a consequence of one implicit obvious decision that the analysis process should begin with the identification of objects. Object detection and identification are one and the same. This leads to the following sequence of steps in the tracking process:
identification of all known objects
search for matching objects on adjacent frames
detection of moving objects among the set of identified
tracking moving objects
analysis of the usefulness and danger of moving objects
In fact, another sequence is possible (and occurs many times in nature) based on the fact that to detect something moving, it is not necessary to identify it. Using this alternative approach leads to the following sequence:
detection of moving objects
tracking moving objects
analysis of danger/usefulness of moving objects
identification of objects for which it makes sense
Such a sequence potentially solves several of the problems listed above at once, including two main problems - the problem of lack of computing power, which leads to an insufficiently fast response to a change in the situation, and the problem of responding to unknown objects.
Really, no one thought of this and did not try to implement this alternative?
Some thought of it and tried to implement it. But it turned out badly. And the reason for this is the same use of obvious solutions - in this case, to find moving objects. It is not difficult to detect the presence of something moving - the most primitive security systems can do this. The difficulty is to divide everything moving into many objects that are subject to tracking separately.
The classical methods of Computer Vision are based on selecting the set of features on the frame - specific points, contour fragments, etc. - followed by analysis. The real object, firstly, is represented by a particular set of features, which leads to the task of dividing the entire collection of detected features into groups that form separate objects, which is difficult, time-consuming, and poorly studied from a practical point of view; secondly, many objects are too small (in pixel units) to be able to build a contour and search for specific points or do not have a clear contour at all by their nature (or due to the specifics of lighting, texture and so on).
However, suppose we find an alternative to the apparent solution (consider the object a set of features). The task becomes quite simple: a whole moving thing can be viewed as a single feature.
This possibility is difficult to detect because a single frame has long been considered the primary object of Computer Vision analysis.
But when it comes to motion detection, it turns out that it is helpful to consider as a fundamental concept not a frame (an image that is essentially isolated and static) but a process of continuous image transformation, due to which it becomes possible to use the impossibility of an arbitrary abrupt change in the analyzed scene from moment to moment. This, of course, is in line with the task of object tracking - but goes beyond the scope of this chapter, the purpose of which is to show the usefulness of analyzing the consequences of using obvious solutions.
The moral of this fable is this: if there is an obvious solution, it is good to look at the problem from different angles - perhaps there is a better non-obvious solution.
As for the details of the alternative approach to object tracking, a separate chapter will be devoted to this.
Another obvious choice: When learning human languages, should we start by learning syntax before semantics or the other way around.
If you are using term vectors, then you are learning word-level semantics before syntax.
This is why GPUs are required.
Without term vectors we can learn a million times faster.
Nice.