AGI: COMPUTER VISION, TOPOLOGY, AND `CONGENITAL` KNOWLEDGE
The three topics mentioned in the title do not, at first glance, appear to be related to each other. However, a detailed analysis reveals close links between them.
One of the most common computer vision tasks is image segmentation. Technologically, a segment is usually represented by a specific set of pixels, in which each pixel is adjacent to another pixel of the same segment. And the proximity of pixels is, of course, an element of topology. The segment `painting` algorithm explicitly uses some way of checking whether two pixels are adjacent.
The traditional way of representing an image as a set of pixels uses a rectangular frame represented by a two-dimensional array of pixels. In essence, this means splitting a large rectangle frame into a series of small rectangle pixels, each with a specific set of attributes (brightness, color, distance, etc.). In this case, a pixel is referred to by a pair of integers, the ordinal numbers of a row and a column in a two-dimensional array. Pixel proximity is easily tested by comparing row and column numbers. Here it is immediately discovered that the `neighborhood` of pixels is two kinds. Four neighbors have a common side with a given pixel and differ by one in the row or the column ordinals, but not both at once. Four other neighbors have only common vertices and numbers of both columns and columns that differ by one.
The advantage of the image representation described above is its simplicity. But there is also a drawback: two types of pixel `neighborhoods`. The properties of splitting a frame into pixel elements differ in different directions. For example, the distance to the center of neighbors of a pixel can vary by 41%; that is, the degree of anisotropy of such a representation is quite large. At the same time, such anisotropy is absent in the visual picture of the world (that is, anisotropy is an undesirable property of representing visual information).
Dividing a plane into identical elements, used to represent an image by a set of pixels, into squares (rectangles) is not a unique option. The apparent alternative is triangles (primarily regular ones). This variation does not look good at once for two reasons: firstly, the possible ways of indexing pixels are more complicated than in the case of a rectangular grid, and secondly, the anisotropy is even stronger, and the number of neighbors increases from 8 to 12 (three with a common side and 9 with a common vertice).
Another alternative to the rectangular pixel grid is the hexagonal grid, known as the honeycomb. If a pixel is represented by a regular hexagon, it has only six neighbors of the same type: all have a common side.
The simplicity of representing a frame with a two-dimensional array of rows and columns is lost, but another variant of pixel indexing provides the simplicity of neighborhood checks. It lies in the fact that each pixel is identified by three integers - a kind of `coordinates` [ U, V, W ], the sum of which is always equal to zero ( U+V+W=0 ). In this case, neighboring pixels have a value different by 1 for precisely one of the values { U,V,W }.
Three `coordinates` allow us to calculate the Cartesian coordinates of the center of a hexagonal pixel:
x = a*(V-W)/2
y = h*U
where `a` is the distance between pixel centers and
h = a*sqrt(3)/2.
Natural vision, both in insects and in more advanced animals, uses precisely the hexagonal grid of light-sensitive elements that provide the densest packing of round physical sensing elements.
Thus, the connection of topology with vision is clarified; it is the turn of congenital (innate, hard-coded) knowledge. In several previous chapters, congenital knowledge was mentioned, which refers to the knowledge that is somehow present in the system initially:
The code of the computer vision module and the natural nervous system inevitably use `congenital` mechanisms for detecting the proximity of pixels in the visual information. This means that formulas in the case of program code or neural connections in the case of neural networks encode the corresponding innate knowledge.
This, in particular, explains why the inclusion of convolutional layers in artificial neural networks is necessary for efficient image processing: convolution is based on the use of information about the representation of the image and is thus part of the innate knowledge about the topology of image elements. Using a hexagonal grid of pixels in artificial neural networks to represent images requires a corresponding modification of the structure of the convolutional layers of the network.
Video cameras deliver information in a traditional rectangular pixel grid. Still, data conversion to a hexagonal grid can be done in the same way that is used to transform image data when changing image resolution. Downscaling is often a stage in visual information processing by neural networks.