5 Comments
Dec 5, 2023Liked by Mykola Rabchevskiy

Right. So from the time dimension we get classification of objects by how they change over time. Which is why my network (that we have been chatting about) uses time correlation as the key foundation of classification (things that correlate in time get grouped together to form features (what you are calling objects)).

How does your system make use of stereo vision which also allows us to classify objects without motion? In theory, but not proven, my network based on statistical correlations will decode both stereo vision information as well as your relative motion information to identify unique features.

And here's a real tricky one for you. Even though we learn about objects in the real world with their relative motion that you talk about, our perception network can look at a static picture and still parse what the objets are. It has no motion at all because it's just a photo or painting or sketch but yet our subconscious perception system still parases it into objects and gets it mostly correctly most the time? If motion defines objects, how is our perception system able to parse a static image into objects correctly?

This would include, for example, a painting of a scene of a road with a alien animal on it that no one has ever seen before. Yet everyone could look at the painting (or a real photo) and see there was a big animal standing in the middle of the road. Our perception system would parse the static image very accurately even though it had never seen any object of this type, and even though there was zero relative motion in the image to indicate to us where the object boundaries are.

It's because we don't just use motion to parse new objects. We use many different clues. We use stereo vision when we can see things in real life with two eyes. In a picture, we use shadows. We recognize the edges formed by the boundaries of color shapes., We use our understanding of gravity to make us assume the animal must be touching the ground, and not floating in the air, which tellus where on the ground it must be "sitting" and given it's location on the ground and our innate understanding of perspective, we get a good idea of its size. We use depth of field with focus to measure distance. So if a painting captures the focus variation, that tells us more about location of objects in 3D space. We use our understanding of the other objects in the scene to establish context (how much room is there on a road to allow an animal to be located in that space). All of this sort of information is parsed in milliseconds by our perception system and we just "see" the object in the painting without having to wait for the brain to do a bunch of number crunching and without us having to logically reason about what we are seeing with our language skills.

And if I see 100 photographs of some new object, my brain learns about the object, even though I've never seen one in real life, and even if I've never seen a video of it so I have no motion data to create the concept of this new object.

In order to use all these clues, and more than we don't think about, you need a generic statical approach to the identification of "objects" in raw sensory data. And when you get the statistics correct, you don't just get visual objects being defined, you get auditory objects defined (like words, or a cat meow, or a dog bark), and these things end up with locations in 3D space even though tyey are just a sound. How doe sounds get separated into "objects" and assigned a location in our 3D space map if we can't use the relative motion data you talked about in this article?

Either we have a million different evolved "tricks" created by evolution with different brain circuits to do all of this and more, or the real trick evolution came up with, was one generic algorithm for doing all of it. I happen to believe the latter is the truth. I think my perception algorithm might be that trick (or very close to it if not the full thing) but I need to spend a lot of time to do the work to study what it really can do, and can't do to know the full extend of its power (or total lack of power if that's the case).

Expand full comment
Dec 4, 2023Liked by Mykola Rabchevskiy

It might be better to think of the visual environment as a 4 dimensional (add time) entity, and Vision provides 2 3D projections. This is a better concept because it allows the movement of the observer and objects to be directly incorporated.

Expand full comment