Google Unveil ‘Vlogger’, an AI That Can Bring Still Photos to Life

Google researchers have developed a new artificial intelligence system that can generate lifelike videos of people speaking, gesturing and moving — from just a single still photo. The technology, called VLOGGER, relies on advanced machine learning models to synthesize startlingly realistic footage, opening up a range of potential applications while also raising concerns around deepfakes and misinformation.

This groundbreaking technology, part of Google’s new Gemini model, is set to revolutionize the way we interact with avatars and multimedia content. Google recently published a blog post on its GitHub page, introducing the VLOGGER AI model.

The AI model can take a photo of a person and an audio clip as input, and then output a video that matches the audio, showing the person speaking the words and making corresponding facial expressions, head movements and hand gestures. The videos are not perfect, with some artifacts, but represent a significant leap in the ability to animate still images.

Described in a research paper titled “VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis,” the AI model can take a photo of a person and an audio clip as input, and then output a video that matches the audio, showing the person speaking the words and making corresponding facial expressions, head movements and hand gestures. The videos are not perfect, with some artifacts, but represent a significant leap in the ability to animate still images.

GENESIS OF VLOGGER AI

Google’s VLOGGER AI is a pioneering creation that allows users to transform a still image into a lifelike, controllable avatar. This innovative model is built on the diffusion architecture, known for its prowess in text-to-image, video, and 3D modelling. By incorporating additional control mechanisms, VLOGGER takes the concept of avatar creation to new heights.

The initial network takes an audio waveform that’s used to create “body motion controls” responsible for gaze, facial expression, and pose.

The secondary network is known as a “temporal image-to-image translation model that extends large image diffusion models, taking the predicted body controls to generate the corresponding frames.”

The AI model aims to function as an “embodied conversational agent” with audio and animated visuals that include realistic and complex facial expressions while demonstrating a high level of body motion.

VLOGGER is supposedly designed to “support natural conversations with a human user,” with the new tool able to be used as a solution for presentations, education, narration, and more.

This new model can function as an artificial intelligence agent that you can talk to while also being able to edit videos.

UNVEILING THE LIMITATIONS OF VLOGGER

While VLOGGER represents a remarkable advancement in AI technology, it is essential to acknowledge its limitations. As a research preview, VLOGGER may not always perfectly replicate the natural movements of individuals. The model, although sophisticated, can encounter challenges with large motions, diverse environments, and handling longer videos. These limitations highlight the ongoing evolution and refinement required in the field of AI.

Google’s researchers envision a myriad of applications for VLOGGER AI. One of the primary use cases identified is its potential to revolutionize communication platforms like Teams or Slack. By enabling users to create animated avatars from still images, VLOGGER opens up new avenues for personalized and engaging interactions in virtual spaces.

Google views VLOGGER as a progression towards a “universal chatbot,” enabling AI to engage with humans naturally through voice, gestures, and eye contact. VLOGGER’s application possibilities extend to reporting, education, and narration. Additionally, it has the capability to edit current videos, allowing users to modify expressions if desired.

Mr. Ogonji is a highly professional and talented journalist with a solid experience in covering compelling stories, reporting facts, and engaging audiences. He is driven to uncover the truth behind today's most pressing issues and share stories that make a genuine impact.

You may also like...