May 25, 2022

Based on the principle of intelligence you subscribe to, reaching human-level AI requires a system that can use multiple modalities—such as sound, image, and text—about the world. For example, when displaying an image of an overturned truck and a police car on an icy highway, human-scale artificial intelligence can deduce that dangerous road conditions were the cause of the accident. Or, working on robots, when asked to take a can of soda from the refrigerator, they can move among people, furniture and pets and keep it within reach of the applicator.

Today’s AI is small. But new research points to encouraging progress, from robots that can design steps to follow basic commands (like “bring a bottle of water”) to text-based systems that learn from explanation. In this updated edition of Deep Science, our weekly series on the latest advances in artificial intelligence and the wider scientific field, we build on the work of DeepMind, Google, and OpenAI on the systems that make up the world, if not the world. – Be able to understand – Solve small problems like drawing pictures with impressive power.

AI Research Lab An enhancement of OpenAI’s DALL-E, DALL-E 2 is arguably the most influential project to emerge from the AI ​​Research Lab. As my colleague Devin Koldavi writes, DALL-E 2 goes even further, and the original DALL-E showed a remarkable ability to create images that fit almost any sign (for example, “dog in a beret”). The images it creates are much more detailed, and the DALL-E 2 can intelligently transform a specific area of ​​an image, such as inserting a table into an image of a marble floor filled with appropriate reflections.


An example of the types of images that DALL-E 2 can generate.

DALL-E 2 has attracted the most attention this week. But on Thursday, Google researchers described an equally impressive visual comprehension system called Visually-Driven Prosody for Text-to-Speech — VDTTS — in a post published on the Google AI blog. VDTTS can reproduce realistic-sounding lip-synced speech using only text and video frames of the person speaking.

The speech generated by VDTTS, while not a perfect replacement for recorded dialogue, is still quite good, with human facial expressions and timing. Google sees that one day it will be used in the studio to replace the original sound that may have been recorded in noisy environments.

Visual understanding is, of course, only one step towards a more capable AI. Another component is language comprehension, which lags behind in many ways, even aside from the well-documented issues of toxicity and AI bias. An obvious example: Google’s complex system, the Pathway Language Model (PaLM), leaked 40% of the data used to “train” it, according to the paper, causing PaLM to place copyright notices in code snippets. text for.

Luckily, DeepMind, the AI ​​lab backed by Alphabet, was one of the first to discover ways to combat this. In a new study, DeepMind researchers are exploring whether AI language systems that learn to generate text from multiple instances of existing text (thought books and social networks) could benefit from providing Explanation those lessons. After explaining dozens of language tasks (e.g., “answer these questions by determining if the second sentence is an appropriate paraphrase of the first, allegorical sentence”) with an explanation (e.g., “David’s eyes are a literal dagger”) No, this is a metaphor that meant, that David was looking at Paul.”) And by looking at the performance of various systems on them, the DeepMind team found that the examples actually improved system performance.

If the DeepMind approach is successful in the academic community, it could one day be applied to robotics and become the building blocks of a robot capable of handling vague requests (like ‘take out the trash’ without step-by-step instructions).” ) can be understood. Google’s new project Do What I Can, Not What I Say offers a glimpse into this future, albeit with significant limitations.

Thanks to the collaboration between Robotics at Google and the Everyday Robotics team at Alphabet X Lab, I can, to the best of my knowledge, propose tasks that are “doable” and “contextually appropriate” for the AI ​​language system. The robot gave the task. The robot serves as the “hands and eyes” of the language system, while the system provides high-level semantic knowledge about the task – the theory is that the language system encodes a lot of knowledge useful to the robot.

google robotics

image credit: robotics on google

A system called SayCan chooses which skill the robot should perform in response to a command, taking into account (1) the likelihood that a particular skill will be useful, and (2) the ability to successfully perform that skill. For example, if someone says, “I spilled a Coke, could you bring me something to clean it up?” SayCan can instruct the robot to find the sponge, take the sponge and go to the person who asked for it. .

SayCan Robotics is limited by hardware – the research team observed the robot they chose to experiment with more than once and accidentally dropped items. However, the work of DALL-E2 and DeepMind on understanding context shows how AI systems combined can bring us closer to one. jetson type Future.

Leave a Reply

Your email address will not be published.