Группа авторов

Cyberphysical Smart Cities Infrastructures


Скачать книгу

line of thought was a fundamental change and led researchers to have a change of direction in their work, and with that came attention to new domains and topics such as robotics, locomotion, artificial life, bioinspired systems, and so on. The classical approach did not care about tasks related to the interaction with the real world, and consequently, this journey is started by locomotion and grasping.

      Nowadays, a big part of these issues are solved, and we can see extremely fast and smooth natural moving robots capable of doing different types of maneuvers [28], but yet it is foreseen that with the advances of artificial muscles, joints, and tendons, this progress can be further improved.

      In this section, we try to categorize a broad range of research that has been done under the field of embodied AI. Due to the huge diversity, each section will necessarily be abstract and selective and reflect the authors' personal opinion.

      3.3.1 Language Grounding

      Machine and human communication has always been a topic of interest. As time goes on, more and more aspects of our lives are controlled by AIs, and hence it is crucial to have ways to talk with them. This is a must for giving new instructions to them or receiving an answer from them, and since we are talking about general day‐to‐day machines, we desire this interface to be higher level than programming languages and closer to spoken language. To achieve this, machines must be capable of relating language to actions and the world. Language grounding is the field that tries to tackle this and map natural language instructions to robot behavior.

      Hermann et al.'s study shows that this can be achieved by rewarding an agent upon successful execution of written instructions in a 3D environment with a combination of unsupervised learning and reinforcement learning [29]. They also argue that their agent can generalize well after training and can interpret new unseen instructions and operate in unfamiliar situations.

      3.3.2 Language Plus Vision

      Now that we know that machines can understand languages and there exist sophisticated models just for this purpose out there [30], it is time to bring another sense into play. One of the most popular ways to show the potential of joint training of vision and language is the image and video captioning [31, 35].

      Following this research, Singh et al. [36] cleverly added an optical character recognition (OCR) module to the VQA model to enable the agent to read the texts available in the image as well and answer questions asked from them or use the additional context indirectly to answer the question better.

      One may ask where the new task stands relative to the previous one. Do agents who can answer questions more intelligent than the ones who deal with captions or not? The answer is yes. In [17], the authors show that VQA agents need a deeper and more detailed understanding of the image and reasoning than models for captioning.

      3.3.3 Embodied Visual Recognition

      Passive or fixed agents may fail to recognize objects in scenes if they are partially or heavily occluded. Embodiment comes to the rescue here and gifts the possibility of moving in the environment to actively control the viewing position and angle to remove any ambiguity in object shapes and semantics.

      Jayaraman and Grauman [37] started to learn representations that will exploit the link between how the agent moves and how it will affect its visual surrounding. To do this they used raw unlabeled videos along with an external GPS sensor that provided the agent's coordinates and trained their model to learn a representation linking these two. So, after this, the agent would have the ability to predict the outcome of its future actions and guess how the scene would look like after moving forward or turning to a side.

      This was powerful and in a sense, the agent developed imagination. However, there was an issue here. If we pay attention, we realize that the agent is still being fed prerecorded video as the input and is learning similar to the observer kitten in the kitten carousel experiment explained above. So, following this, the authors went after this problem and proposed to train an agent that takes any given object from an arbitrary angle and then predict or better to say imagine the other views by finding the representation in a self‐supervised manner [38].

      Up until this point, the agent does not use the sound of its surroundings while humans are all about experiencing the world in a multisensory manner. We can see, hear, smell, and touch all at the same time and extract and use the relevant information that could be beneficial to our task at hand. All that said, understanding and learning the sound of objects present in a scene is not easy since all the sounds are overlapped and are being received via a single channel sensor. This is often dealt with as an audio source separation problem, and lots of work has been done on it in the literature [39, 43].

      Results show that policies indeed help the agent to achieve better visual recognition performance, and the agents can strategize their future moves and path for better results that are mostly different from shortest paths [51].

      3.3.4 Embodied Question Answering

      Embodied Question Answering brings QA into the embodied world. The task starts by an agent being spawned at a random location in a 3D environment and asked a question in which its answer can be found somewhere in the environment. For the agent to answer it, it must first strategically navigate to explore the environment, gather necessary data via its vision, and then answer the question when the agent finds it [52, 53].

      Following this, Das et al. [54] also presented a modular approach to further enhance this process by teaching the agent to break the master policy into subgoals that are also interpretable by humans and execute them to answer the question. This proved to increase the success rate.

      3.3.5 Interactive Question Answering

      Interactive Question Answering (IQA) is closely related to the Embodied version of it. The only main issue is that question is designed in a way that the agent must interact with the environment to find the answer. For example, it has to open the refrigerator or pick up something from the cabinet and then plan for a series of actions conditioned on the question [55].

      3.3.6 Multi‐agent Systems