Right. So from the time dimension we get classification of objects by how they change over time. Which is why my network (that we have been chatting about) uses time correlation as the key foundation of classification (things that correlate in time get grouped together to form features (what you are calling objects)).
How does your system make use of stereo vision which also allows us to classify objects without motion? In theory, but not proven, my network based on statistical correlations will decode both stereo vision information as well as your relative motion information to identify unique features.
And here's a real tricky one for you. Even though we learn about objects in the real world with their relative motion that you talk about, our perception network can look at a static picture and still parse what the objets are. It has no motion at all because it's just a photo or painting or sketch but yet our subconscious perception system still parases it into objects and gets it mostly correctly most the time? If motion defines objects, how is our perception system able to parse a static image into objects correctly?
This would include, for example, a painting of a scene of a road with a alien animal on it that no one has ever seen before. Yet everyone could look at the painting (or a real photo) and see there was a big animal standing in the middle of the road. Our perception system would parse the static image very accurately even though it had never seen any object of this type, and even though there was zero relative motion in the image to indicate to us where the object boundaries are.
It's because we don't just use motion to parse new objects. We use many different clues. We use stereo vision when we can see things in real life with two eyes. In a picture, we use shadows. We recognize the edges formed by the boundaries of color shapes., We use our understanding of gravity to make us assume the animal must be touching the ground, and not floating in the air, which tellus where on the ground it must be "sitting" and given it's location on the ground and our innate understanding of perspective, we get a good idea of its size. We use depth of field with focus to measure distance. So if a painting captures the focus variation, that tells us more about location of objects in 3D space. We use our understanding of the other objects in the scene to establish context (how much room is there on a road to allow an animal to be located in that space). All of this sort of information is parsed in milliseconds by our perception system and we just "see" the object in the painting without having to wait for the brain to do a bunch of number crunching and without us having to logically reason about what we are seeing with our language skills.
And if I see 100 photographs of some new object, my brain learns about the object, even though I've never seen one in real life, and even if I've never seen a video of it so I have no motion data to create the concept of this new object.
In order to use all these clues, and more than we don't think about, you need a generic statical approach to the identification of "objects" in raw sensory data. And when you get the statistics correct, you don't just get visual objects being defined, you get auditory objects defined (like words, or a cat meow, or a dog bark), and these things end up with locations in 3D space even though tyey are just a sound. How doe sounds get separated into "objects" and assigned a location in our 3D space map if we can't use the relative motion data you talked about in this article?
Either we have a million different evolved "tricks" created by evolution with different brain circuits to do all of this and more, or the real trick evolution came up with, was one generic algorithm for doing all of it. I happen to believe the latter is the truth. I think my perception algorithm might be that trick (or very close to it if not the full thing) but I need to spend a lot of time to do the work to study what it really can do, and can't do to know the full extend of its power (or total lack of power if that's the case).
[1] Described approach can be used regardless of whether it is used by a humans
[2] Statistical (or pseudo-statistical) methods based on discovering sensory input features do not make it possible to separate an unknown object from the rest of the observed one. Detected features are actually features of the SCENE, not a features of the particulat object.
Detecting paricular OBJECT`s features is possible after this object is separated from the rest of visible scene, so statistical approach leads to a logical loop: to separate particular object we need to learn object`s features, but to do this we need to separate object from the rest.
[3] When object`s separation is done, obhect`s features can be collected, remembered and used for detection of such objects as ALREADY KNOWN ones using "classic" methods using single image.
Your comments gave me a good idea for my network. One hurdle that has slowed me down from working on it was the fear that if I expanded the networks and threw lots of complex real world data at it (like a video stream), I would have a very hard time telling what my networks were doing with the data. But I just realized, that for video, there's a simple test I can do. For each output feature "pixel" my network produces, I can back calculate where the data came from and show a heat map of which input pixels and what percentage of each contributed to each output feature. So any one output feature signal from my network, I can make a modified version of the output video that masks out what input pixel data was assigned to the feature, and I can repeat for all the features.
In my network, every input pixel will be mapped to one more more output feature pixels, so it is doing the separation you mentioned. I don't just pick out "objects" it clasifies every pixel of every input as being part of some "feature" in the output set.
But the above testing will make a simple visual representation of what features have been extraced by the network and what each feature represents on the screen. This will make it very easy for me to see if it's doing something useful, like identifying motion or lines, or "cats" etc or if it's just gibberish!
And if it works as well as I hope it will, it will be a great demonstration of the powers of the network. Well, not I just have to find the time to do the coding!
It might be better to think of the visual environment as a 4 dimensional (add time) entity, and Vision provides 2 3D projections. This is a better concept because it allows the movement of the observer and objects to be directly incorporated.
This may be useful on a "philosophical" level, but on a technological level it becomes difficult. For example, distance in 2D or 3D can be calculated as the length of the corresponding vector, but for 4D this is not possible, since meters cannot be added to seconds.
Right. So from the time dimension we get classification of objects by how they change over time. Which is why my network (that we have been chatting about) uses time correlation as the key foundation of classification (things that correlate in time get grouped together to form features (what you are calling objects)).
How does your system make use of stereo vision which also allows us to classify objects without motion? In theory, but not proven, my network based on statistical correlations will decode both stereo vision information as well as your relative motion information to identify unique features.
And here's a real tricky one for you. Even though we learn about objects in the real world with their relative motion that you talk about, our perception network can look at a static picture and still parse what the objets are. It has no motion at all because it's just a photo or painting or sketch but yet our subconscious perception system still parases it into objects and gets it mostly correctly most the time? If motion defines objects, how is our perception system able to parse a static image into objects correctly?
This would include, for example, a painting of a scene of a road with a alien animal on it that no one has ever seen before. Yet everyone could look at the painting (or a real photo) and see there was a big animal standing in the middle of the road. Our perception system would parse the static image very accurately even though it had never seen any object of this type, and even though there was zero relative motion in the image to indicate to us where the object boundaries are.
It's because we don't just use motion to parse new objects. We use many different clues. We use stereo vision when we can see things in real life with two eyes. In a picture, we use shadows. We recognize the edges formed by the boundaries of color shapes., We use our understanding of gravity to make us assume the animal must be touching the ground, and not floating in the air, which tellus where on the ground it must be "sitting" and given it's location on the ground and our innate understanding of perspective, we get a good idea of its size. We use depth of field with focus to measure distance. So if a painting captures the focus variation, that tells us more about location of objects in 3D space. We use our understanding of the other objects in the scene to establish context (how much room is there on a road to allow an animal to be located in that space). All of this sort of information is parsed in milliseconds by our perception system and we just "see" the object in the painting without having to wait for the brain to do a bunch of number crunching and without us having to logically reason about what we are seeing with our language skills.
And if I see 100 photographs of some new object, my brain learns about the object, even though I've never seen one in real life, and even if I've never seen a video of it so I have no motion data to create the concept of this new object.
In order to use all these clues, and more than we don't think about, you need a generic statical approach to the identification of "objects" in raw sensory data. And when you get the statistics correct, you don't just get visual objects being defined, you get auditory objects defined (like words, or a cat meow, or a dog bark), and these things end up with locations in 3D space even though tyey are just a sound. How doe sounds get separated into "objects" and assigned a location in our 3D space map if we can't use the relative motion data you talked about in this article?
Either we have a million different evolved "tricks" created by evolution with different brain circuits to do all of this and more, or the real trick evolution came up with, was one generic algorithm for doing all of it. I happen to believe the latter is the truth. I think my perception algorithm might be that trick (or very close to it if not the full thing) but I need to spend a lot of time to do the work to study what it really can do, and can't do to know the full extend of its power (or total lack of power if that's the case).
Explanations:
[1] Described approach can be used regardless of whether it is used by a humans
[2] Statistical (or pseudo-statistical) methods based on discovering sensory input features do not make it possible to separate an unknown object from the rest of the observed one. Detected features are actually features of the SCENE, not a features of the particulat object.
Detecting paricular OBJECT`s features is possible after this object is separated from the rest of visible scene, so statistical approach leads to a logical loop: to separate particular object we need to learn object`s features, but to do this we need to separate object from the rest.
[3] When object`s separation is done, obhect`s features can be collected, remembered and used for detection of such objects as ALREADY KNOWN ones using "classic" methods using single image.
Your comments gave me a good idea for my network. One hurdle that has slowed me down from working on it was the fear that if I expanded the networks and threw lots of complex real world data at it (like a video stream), I would have a very hard time telling what my networks were doing with the data. But I just realized, that for video, there's a simple test I can do. For each output feature "pixel" my network produces, I can back calculate where the data came from and show a heat map of which input pixels and what percentage of each contributed to each output feature. So any one output feature signal from my network, I can make a modified version of the output video that masks out what input pixel data was assigned to the feature, and I can repeat for all the features.
In my network, every input pixel will be mapped to one more more output feature pixels, so it is doing the separation you mentioned. I don't just pick out "objects" it clasifies every pixel of every input as being part of some "feature" in the output set.
But the above testing will make a simple visual representation of what features have been extraced by the network and what each feature represents on the screen. This will make it very easy for me to see if it's doing something useful, like identifying motion or lines, or "cats" etc or if it's just gibberish!
And if it works as well as I hope it will, it will be a great demonstration of the powers of the network. Well, not I just have to find the time to do the coding!
It might be better to think of the visual environment as a 4 dimensional (add time) entity, and Vision provides 2 3D projections. This is a better concept because it allows the movement of the observer and objects to be directly incorporated.
This may be useful on a "philosophical" level, but on a technological level it becomes difficult. For example, distance in 2D or 3D can be calculated as the length of the corresponding vector, but for 4D this is not possible, since meters cannot be added to seconds.