Microsoft VASA

Microsoft Shows Off Generative AI Model That Makes Deepfake Videos From Still Photos

A team of AI researchers at Microsoft Research Asia has demonstrated a new generative AI application capable of producing a deepfake video of a person from just a still image. The new VASA-1 model creates an animation synced to an audio track that accurately portrays the individual speaking or singing, complete with appropriate facial expressions.


VASA-1 is named for its ability to create videos with visual affective skills (VAS). The researchers aimed to animate still images with realistic facial expressions that synchronize seamlessly with provided audio tracks. VASA-1 was trained on thousands of images depicting various facial expressions. Through extensive experimentation and development, they successfully created animations that are synced well enough to possibly fool casual viewers. Of course, the researchers did acknowledge the presence of imperfections that indicate artificial generation upon closer inspection.

Still, the effectiveness of VASA-1 has been demonstrated through various video samples shared by the research team, showcasing its ability to animate diverse subjects, including cartoon characters, photographs of individuals, and even hand-drawn images. In each instance, the facial expressions dynamically change in accordance with the spoken or sung words, enhancing the overall believability of the animations. You can see an example of a video with different emotions below.

“Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos,” the researchers explained in their paper. “It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.”

VASA-1 holds potential applications in generating lifelike avatars for gaming and simulations, especially if combined with other deepfake tech like the VALL-E synthetic voice clone model Microsoft introduced last year. The research team cautioned against rushing to use VASA-1 too quickly due to concerns regarding potential misuse or ethical implications. As a result, the system is not currently available for general usage, reflecting the researchers’ commitment to responsible AI development.

Microsoft Debuts 3-Second Voice Cloning Tool VALL-E

Pindrop Launches Real-Time Audio Deepfake Detection Tool Pindrop Pulse

Generative AI Video Startup D-ID Partners with Deepfake Voice Startup ElevenLabs