Microsoft VASA-1: Windows maker Microsoft, in its research blog post, detailed an AI model Visual Affective Skills Audio (VASA-1), which can generate lifelike faces using a single image and audio clip. Apart from clean lip movement to sync with the audio, it can also capture a varying spectrum of facial cues and head movements to mimic liveliness and authenticity. It supports media generation of up to one-minute 512×512 resolution videos at up to 45 frames per second.

While the videos have an initial latency (170ms), it does offer a range of possibilities to emulate conversational behaviour via lifelike AI avatars. Microsoft shared a range of output generated using StyleGAN2 or DALL·E-3 and said it is only a demonstration for research purposes.

“Our method is capable of not only producing precious lip-audio synchronisation, but also generating a large spectrum of expressive facial nuances and natural head motions,” Microsoft said. It uses audio …

Watch/Read More