JAKARTA Google DeepMind introduces a new AI tool to generate video soundtracks. This tool not only uses text as a prompt to produce audio, but also pays attention to video content.

According to DeepMind, by combining these two elements, users can use this tool to create scenes with "dra drama scores, realistic voice effects, or dialogues that match the characters and video tones." Some examples can be seen on DeepMind websites, which show quite satisfactory audio results.

For example, for videos of cars driving through the city of cyberpunk, Google uses the prompt "gliding car," car engines roar, electronic music of the angel" to produce audio. The sound of tires sliding is synchronized with the motion of the car. Another example creates an underwater sound landscape using the "fertilizer pulsed underwater, marine life, ocean."

Although users can include prompt text, DeepMind says it's optional. Users also don't need to match the generated audio to the exact scene in detail. According to DeepMind, this tool can generate an "infinite number" soundtrack for videos, allowing users to create an unlimited audio stream.

This can make it stand out from other AI tools, such as the sound effect generator of ElevenLabs which uses prompt text to produce audio. This tool can also make it easier to pair audio with videos generated by AI from tools such as DeepMind's Veo and Sora (the last one will combine audio in the future).

DeepMind says they train its AI tools using video, audio, and annotations that contain "detailed descriptions of sound and transcripts of the spoken dialogue." This allows video-to-audio generators to match audio events with visual scenes.

This tool still has some limitations. For example, DeepMind is trying to improve its ability to sync lip movements with dialogue, as seen in the video family clay. DeepMind also notes that this video-to-audio system depends on the quality of the video, so that blurry or distorted videos "can cause a real decrease in audio quality."

This DeepMind tool is not yet available in general because it still has to undergo "strict security assessment and testing." When available, its audio output will include Google's Synthed watermark to signal that it was generated by AI.


The English, Chinese, Japanese, Arabic, and French versions are automatically generated by the AI. So there may still be inaccuracies in translating, please always see Indonesian as our main language. (system supported by DigitalSiber.id)