This chapter is devoted to aspects at the intersection of mathematics, physics, epistemology, and philosophy regarding Large Language Models, the technical basis of which are neural networks.
The first neural network models were used as a means of predicting quantitative values that depend on many quantitative factors, such as, for example, temperature or pressure (in weather forecasts, for example).
Neural network model technology, as described in AGI: ARTIFICIAL NEURAL NETS comes down to finding the values of the numerical parameters of the neural network, at which the input data from the "training set" of data gives the best (from practically achievable) result; each sample of training data for this is accompanied by the expected result. Differences from the expected response to the test data form an overall assessment of the quality of the model. In the process of "training", step by step, the numerical parameters of the neural network are modified, improving the integral estimate of the quality of the model. The process is apparent in its idea - even though its practical implementation is challenging with a huge number of neural network parameters.
It has been proven that the neural network is an ideal approximator: no matter how complex the dependence of perfect results on the input data is, it is possible to build a neural network that will provide the given accuracy of the model. And the proof, as it should be in mathematics, is based on a specific set of assumptions that must be fulfilled for the resulting statement (in this case, about the universality of the approximation) to be correct. Such assumptions are stipulated in the texts of proof. Still, outside these texts, they are more often implied without being explicitly stated.
In our case, such assumptions are the quantitative nature of the numerical values at the input and output of the neural network. Moreover, continuity is required - that is, the values can differ from each other by an arbitrarily small amount. This is the "standard" assumption for both classical approximation methods and neural network models. This requirement is quite natural: if the goal is to make the integral error of the model as small as possible, the error for the output for specific input data should be arbitrarily small. Simply put, all inputs and results are assumed to be fractional numbers.
Less obvious and, therefore, rarely voiced is the requirement to have a way to calculate the error in each specific training case. For example, suppose we are talking about calculating the volume of an egg by two sizes (the largest and smallest diameter). In that case, the error is the difference between the actual volume and the neural network's calculation. Both widths are specified as fractional numbers, and the error represents a fractional number. As an integral error on a set of training tasks, as a rule, the root mean square for all training tasks is used, but this is not essential.
From the point of view of mathematics, numerical values differ from non-numerical ones in that arithmetic operations make sense for numerical ones. In everyday life, we rarely distinguish between numbers and a sequence of numbers. For example, a house or telephone number is often perceived as a numerical value. It makes no sense to calculate the product of a house's street number and a telephone number, so these are not numerical values but words/labels/tags made up of digits. Therefore, it makes no sense to calculate an error in a telephone number as a difference in telephone numbers; a phone number can be just correct or incorrect, TRUE or FALSE. And even if logical values are encoded with the digits 1 and 0, whatever the two phone numbers are, their comparison cannot be anything other than 0 and 1. Accordingly, it cannot be calculated using arithmetic operations on phone numbers - they are pseudo-numbers.
Now it's the turn of physics. In practice, the error is not "just a number," like the input values - they have a particular "physical dimension"; for example, the error in calculating the volume has the dimension of units of volume (milliliters, cubic inches). In addition to the physical dimension, these quantities have a "semantic dimension" that makes it possible to reasonably interpret the result. For example, adding length to voltage, frequency to temperature, and so on makes no sense. This concerns the calculation of the integral error on the set of training test samples: errors, for all samples, must have the same physical dimension and the same meaning. For example, the "dimensionless" rain probability error cannot be combined in the integral error calculation with the probability of winning the lottery; this is meaningless due to the different semantic dimensions since it is impossible to formulate the semantics of the final value.
The continuity of the resulting values and their semantic compatibility means that if the input set X1 gives the result Y1, and the input set X2 provides the result with Y2, then the intermediate values of X between X1 and X2 will correspond to some results between Y1 and Y2, and they can be interpreted reasonably. The essence of approximation rests precisely on this: if the input data does not match exactly with any of the training tasks, the neural network must calculate the result based on the proximity to the training tasks. No less important is that the results obtained between Y1 and Y2 can be clearly interpreted due to the continuity; the absence of continuity immediately deprives the possibility of a meaningful interpretation (a 2.73 piano, 3.14 publications, etc.)
It is crucial that the elements of the input dataset, in contrast to the magnitude of the test error, may have different dimensions - their transformation is provided by processes inside the neural network, in contrast to the calculation of the integral estimate, which is performed outside the network by the neural model formation system.
All the requirements listed above are practically never articulated in publications relating to neural networks - which is not a problem as long as they are met.
With the growth of the capabilities of computers and the growth of the number of parameters of neural networks, both areas of application have been successfully expanded due to the tasks of image recognition. All of the above assumptions are fulfilled: the transformation is encoded by a set of numerical brightness values, the results reflect the degree of similarity of the analyzed image to the training images, and for any pair of images, there is an interpretable chain of intermediate options.
Finally, the developers of neural networks took up texts in natural language. It is difficult to say how many of them clearly understood that, in this case, the implicit assumptions listed above cease to be true. In any case, this circumstance was wholly dismissed, but despite this, neural networks showed results that inspired developers to "thicken" models to combat the visible consequences of violating implicit assumptions that are already well known to the public.
Texts, by their nature, are not continuous objects; intermediate options between two texts, unlike images, cannot be interpreted even if they can be constructed. This is a natural consequence of the fact that the elements of the language - words and phrases - encode CONCEPTS, and concepts are discrete in essence. There is no rational way to interpret the range of "intermediate" concepts between "philosophy" and "scrambled eggs," "width" and "tomorrow"; it is pointless to compare which difference is greater - between artiodactyls and programming or between a cloud and a string. Numbers are naturally used to encode lexemes - but their use is similar to arithmetic operations on telephone numbers. Correct manipulation of concepts uses a different set of operations than numerical values, primarily "is" ("sparrow is a bird").
In addition, the numerical value of the error has neither physical nor semantic dimension: the semantic dimension when comparing two concepts, if it can be used, is by applying operations on concepts and not on numbers, which are used as identifiers of concepts. This can be illustrated by the situation with the numbering of houses: the proximity of house numbers may correspond to proximity in space, but not always - for example, on the ring street, the houses with the largest and smallest numbers may be neighbors. As a result, the universal approximator theorem ceases to be valid. What does this mean in practice? The situations are possible when the neural network will evaluate the error as insignificant due to the small difference in numbers, while the semantic error will be significant - which is demonstrated by the advanced versions of Large Language Models, regardless of the size of the neural network. Below are a couple of such errors in which ChatGPT "substantiates" apparently false claims:
In reality, everything is precisely the opposite: for any dimension of space, a sphere has a smaller volume than a cube of the same size since all points of the sphere are inside the cube, and some points of the cube (vertices and their neighborhoods) are outside the sphere (and included formula confirms this); the horns and branches of trees can branch only because the growing points are at the ends of the branches, but not at the base.
The logical path from violating implicit assumptions to "specific" results (like the above or explanations of how to achieve a negative friction coefficient) is probably not apparent to developers familiar with the technique of using neural networks but not with classical applied mathematics. Therefore, attempts to achieve the impossible using neural networks and manipulating numbers arithmetically instead of manipulating concepts using specific operations will continue for some time.
Those who have the goal of AGI with a full range of relevant abilities have the opportunity to spend their effort on something other than Sisyphean work on teaching neural networks logic and use realistic ways to achieve the goal. In particular, use the representation of concepts and relations between them based on semantic graphs: SEMANTIC STORAGE
Totally agree with the thrust of this article, but wonder a bit about some details and the LLMs. Will look into measure of distance for vectors of binary or categorical (discrete) values.