Popular science articles and news from * artificial intelligence*,

*,*

**Deep Learning***,*

**Large Language Models***, and so on, voluntarily or unwittingly, contribute to the spread of an inadequate understanding of neural networks. This chapter discusses aspects rarely mentioned in popular sources of information and interfere with a realistic assessment of the situation.*

**GPT**The tone and number of publications about software tools based on neural networks can give the impression that this is a very complex technology, in essence, based on some new ideas. In reality, everything is precisely the opposite. The concept of neural networks has been known for about half a century and is very simple in essence, and the program code that implements a functioning neural network takes about a page of code. Why have tools based on neural networks only become practically used in the last decade? The reason for this is the other side of the coin: the simplicity of the idea is offset by the need for enormous computational resources to provide valuable results in practice.

From a * mathematical* point of view, a neural network is an ordinary

*, which is similar to the*

**function***function or a function that determines the shortest path between two points on the earth's sphere:*

**sin (x)**Using a neural network, like any function, comes down to setting arguments and calculating the result. If the software tool is capable of more, then the neural network does not implement these additional features; some other things are used along it.

From a * programming* point of view, a neural network is an enormous

*. Implementation complexity is determined solely by quantitative factors: the number of constants in this function/expression was measured in thousands, then millions, and today it reaches billions.*

**expression**A valuable use of a neural network requires constructing an expression of such a * structure* that can provide the desired results and then assigning

*of this expression.*

**values to all constants**The expression structure is given by the structure of the neural network (* directed graph*), which is converted into a sequence of elementary operations for calculations.

Proven ( see Universal approximation theorem ) that a neural network as a category of objects is an * ideal approximator*: any function can be represented by a specific neural network, providing a calculation with the required accuracy. Wonderful, isn't it? But there is a

*: the corresponding*

**nuance***. Therefore, the degree of success of using a neural network radically depends on what structure of the neural network is chosen and what method of selection of constants is used.*

**theorem does not give any hints as to what the structure of the neural network should be for a particular task and what the constants' values should determine the calculation's desired result**Choosing the structure of a neural network is an informal task and comes down to trial and error and the use of structures that have proven to be helpful in similar tasks; in terms of the combination of methods, this is a modern analog of * alchemy*.

The term "* selection* of constants" is not used by chance. In contrast to the "classical" methods of approximation, where the constants of the approximator function are

*, in the case of neural networks, it is the*

**calculated by solving a system of equations***that takes place; the selection process got inherently slangy name "neural network*

**selection***". The choice of the values of a gigantic set of constants naturally requires equally tremendous computational resources. Reducing the time of the selection of constants ("training") process dictates the use of supercomputers, GPU/TPU, which in turn increases the complexity and size of the code that implements this selection.*

**training**The process of selecting the values of the constants is based on the estimation of the * integral value of the errors*. The integral estimate analyzes the difference between the expected values and what the tested set of constant values gives. The error size is calculated for

*, and then the quality of the set of constants is evaluated. Thus, selecting constants ("training" a neural network) is the*

**each sample from the "training" data set***. The minimization problem is usually reduced to a*

**task of minimizing the approximation error***that reduces the integral error (using, for example,*

**step-by-step correction of the values of the constants***). The process is very costly, which, along with the high labor intensity of preparing "training" data, makes the process expensive.*

**gradient descent**The problem not emphasized in publications devoted to the technology of neural networks (* Deep Learning, LLM, etc.*) is that for complex applications, the

*. Stepwise improvement leads to*

**task of minimizing the integral error by choosing constants is multi-extremal***, the number of which is unknown, and which one*

**one of many extremums***(and partly on the method of stepwise improvement: some large steps can lead from the neighborhood of one minimum to the neighborhood of another). In most cases, the recommendations come down to the fact that you need to choose the initial set at random and repeat the selection/training process for several starting combinations. However, it seems more reasonable to organize the selection in two stages: in the first phase, many random combinations are tested constants, and the combination with the smallest integral error is selected, which is then improved step by step in the second phase of the process.*

**depends on the starting set of constants**Another problem is that the integral estimate of the error (* scalar* value) may not be adequate for the essence of the problem; two different sets of constants with the same value of the integral error will produce the most sensitive errors in different situations. Which option is preferable should be clarified using additional considerations.

Those who are interested in "playing around" with neural networks can use our straightforward implementation, which is convenient for experimentation (single C ++ header file): https://github.com/mrabchevskiy/neuro

The pictures below show the results of an unsuccessful and successful selection of the starting constant combination for the problem of neural network approximation of a Boolean function of two variables:

**f(x,y) = ( x*x + y*y ) > R*R**

Which is equal to **1** outside the circle of radius **R** and **0** inside it:

The neural network configuration, including the values of the constants, is saved in the file (and loaded from the file); in the case of a two-dimensional task, images are rendered using the cross-platform **SFML** package. Details are in the comments inside the code.

In conclusion, the described option for * choosing constants* with an

*(which makes neural networks similar to classical approximation methods) is not the only possible approach. Alternative approaches based on*

**unchanged function structure***can be found in the chapter AGI: STRUCTURES DISCOVERING of this series and in speeches and publications by*

**function structure selection***Experiments on the generalization of machine learning - AGI-21 Conference.*

**Arthur Franz:**