This chapter is devoted to aspects at the intersection of mathematics, physics, epistemology, and philosophy regarding **Large Language Models**, the technical basis of which are neural networks.

The first neural network models were used as a means of predicting quantitative values that depend on many quantitative factors, such as, for example, temperature or pressure (in weather forecasts, for example).

Neural network model technology, as described in AGI: ARTIFICIAL NEURAL NETS comes down to finding the values of the numerical parameters of the neural network, at which the input data from the "* training set*" of data gives the best (from practically achievable) result; each sample of training data for this is accompanied by the expected result. Differences from the expected response to the test data form an overall assessment of the quality of the model. In the process of "

*", step by step, the numerical parameters of the neural network are modified, improving the integral estimate of the quality of the model. The process is apparent in its idea - even though its practical implementation is challenging with a huge number of neural network parameters.*

**training**It has been proven that the neural network is an * ideal approximator*: no matter how complex the dependence of perfect results on the input data is,

*. And the proof, as it should be in mathematics, is based on a*

**it is possible to build a neural network that will provide the given accuracy of the model***(in this case, about the universality of the approximation) to*

**specific set of assumptions that must be fulfilled for the resulting statement***. Such assumptions are stipulated in the texts of proof. Still, outside these texts, they are more often*

**be correct***.*

**implied without being explicitly stated**In our case, such assumptions are the quantitative nature of the numerical values at the input and output of the neural network. Moreover, * continuity* is required - that is, the values can differ from each other by an arbitrarily small amount. This is the "standard" assumption for both classical approximation methods and neural network models. This requirement is quite natural: if the goal is to

*,*

**make the integral error of the model as small as possible***. Simply put, all inputs and results are assumed to be*

**the error for the output for specific input data should be arbitrarily small***.*

**fractional numbers**Less obvious and, therefore, rarely voiced is the requirement * to have a way to calculate the error in each specific training case*. For example, suppose we are talking about calculating the volume of an egg by two sizes (the largest and smallest diameter). In that case, the error is the difference between the actual volume and the neural network's calculation. Both widths are specified as fractional numbers, and the error represents a fractional number. As an integral error on a set of training tasks, as a rule, the

*for all training tasks is used, but this is not essential.*

**root mean square**From the point of view of mathematics, numerical values differ from non-numerical ones in that * arithmetic operations make sense for numerical ones*. In everyday life, we rarely distinguish between numbers and a sequence of numbers. For example, a house or telephone number is often perceived as a numerical value. It

*, so these are*

**makes no sense to calculate the product of a house's street number and a telephone number***but*

**not numerical values***. Therefore, it*

**words/labels/tags made up of digits***; a phone number can be just*

**makes no sense to calculate an error in a telephone number as a difference in telephone numbers***or*

**correct***,*

**incorrect***or*

**TRUE***. And even if logical values are encoded with the digits*

**FALSE****1**and

**0**, whatever the two phone numbers are, their comparison cannot be anything other than 0 and 1. Accordingly, it cannot be calculated using arithmetic operations on phone numbers - they are

*.*

**pseudo-numbers**Now it's the turn of physics. In practice, the error is not "just a number," like the input values - they have a particular "* physical dimension*"; for example, the error in calculating the volume has the dimension of units of

*(milliliters, cubic inches). In addition to the physical dimension, these quantities have a "*

**volume***" that makes it possible to reasonably interpret the result. For example, adding*

**semantic dimension***to*

**length***,*

**voltage***to*

**frequency***, and so on makes no sense. This concerns the calculation of the integral error on the set of training test samples:*

**temperature***. For example, the "dimensionless"*

**errors, for all samples, must have the same physical dimension and the same meaning***cannot be combined in the integral error calculation with the*

**rain probability error***; this is meaningless due to the different semantic dimensions since it is impossible to formulate the semantics of the final value.*

**probability of winning the lottery**The continuity of the resulting values and their semantic compatibility means that if the input set **X1 **gives the result **Y1**, and the input set **X2** provides the result with **Y2**, then the intermediate values of **X** between **X1** and **X2** will correspond to some * results between Y1 and Y2, and they can be interpreted reasonably*. The essence of approximation rests precisely on this: if the input data does not match exactly with any of the training tasks, the neural network must calculate the result based on the proximity to the training tasks. No less important is that the results obtained between

**Y1**and

**Y2**can be clearly interpreted due to the continuity; the

*(a 2.73 piano, 3.14 publications, etc.)*

**absence of continuity immediately deprives the possibility of a meaningful interpretation**It is crucial that the elements of the input dataset, in contrast to the magnitude of the test error, may have different dimensions - their transformation is provided by processes inside the neural network, in contrast to the calculation of the integral estimate, which is performed outside the network by the neural model formation system.

All the requirements listed above are practically never articulated in publications relating to neural networks - which is not a problem * as long as they are met*.

With the growth of the capabilities of computers and the growth of the number of parameters of neural networks, both areas of application have been successfully expanded due to the tasks of * image recognition*. All of the above assumptions are fulfilled: the transformation is encoded by a set of numerical brightness values, the results reflect the degree of similarity of the analyzed image to the training images, and

*.*

**for any pair of images, there is an interpretable chain of intermediate options**Finally, the developers of neural networks took up * texts in natural language*. It is difficult to say how many of them clearly understood that, in this case, the implicit assumptions listed above

*. In any case, this circumstance was wholly dismissed, but despite this, neural networks showed results that inspired developers to "thicken" models to combat the visible consequences of violating implicit assumptions that are already well known to the public.*

**cease to be true**Texts, by their nature, are * not continuous objects*; intermediate options between two texts, unlike images, cannot be interpreted even if they can be constructed. This is a natural consequence of the fact that the

*, and*

**elements of the language - words and phrases - encode CONCEPTS***. There is no rational way to interpret the range of "intermediate" concepts between "*

**concepts are discrete in essence***" and "*

**philosophy***," "*

**scrambled eggs***" and "*

**width***"; it is pointless to compare which difference is greater - between*

**tomorrow***and*

**artiodactyls***or between a*

**programming***and a*

**cloud***. Numbers are naturally used to*

**string***lexemes - but their use i*

**encode***. Correct manipulation of concepts uses a different set of*

**s similar to arithmetic operations on telephone numbers***than numerical values, primarily "*

**operations***" ("*

**is***sparrow*").

**is**a birdIn addition, the numerical value of the error has neither physical nor semantic dimension: the semantic dimension when comparing two concepts, if it can be used, is by applying * operations on concepts* and

*, which are used as*

**not on numbers***of concepts. This can be illustrated by the situation with the numbering of houses: the proximity of house numbers may correspond to proximity in space, but not always - for example, on the ring street, the houses with the largest and smallest numbers may be neighbors. As a result, the*

**identifiers***. What does this mean in practice? The situations are possible when the*

**universal approximator theorem ceases to be valid***- which is demonstrated by the advanced versions of*

**neural network will evaluate the error as insignificant due to the small difference in numbers, while the semantic error will be significant***, regardless of the size of the neural network. Below are a couple of such errors in which*

**Large Language Models***"substantiates" apparently false claims:*

**ChatGPT**In reality, everything is precisely the opposite: *for any dimension of space, a sphere has a smaller volume than a cube of the same size since all points of the sphere are inside the cube, and some points of the cube (vertices and their neighborhoods) are outside the sphere (and included formula confirms this)*; *the horns and branches of trees can branch only because the growing points are at the ends of the branches, but not at the base.*

The logical path from violating implicit assumptions to "specific" results (like the above or explanations of how to achieve a * negative friction coefficient*) is probably not apparent to developers familiar with the technique of using neural networks but not with classical applied mathematics. Therefore, attempts to achieve the impossible using neural networks and manipulating numbers arithmetically instead of manipulating concepts using specific operations will continue for some time.

Those who have the goal of AGI with a full range of relevant abilities have the opportunity to spend their effort on something other than Sisyphean work on teaching neural networks logic and use realistic ways to achieve the goal. In particular, use the representation of concepts and relations between them based on semantic graphs: SEMANTIC STORAGE

Totally agree with the thrust of this article, but wonder a bit about some details and the LLMs. Will look into measure of distance for vectors of binary or categorical (discrete) values.