Google has unveiled Lumiere, the state-of-the-art in realistic text-to-image and video generative AI. The software greatly improves upon the motion by using a novel approach to video frame generation that creates all the frames in one pass to mitigate motion errors.
Generative image AI creates images from text. One key enabling this is the huge amount of online images and videos available for training. Another is the development of methods to associate all words in a language with each other through vectors. Therefore, AI can understand as a pair of words, or in a sentence, “I am” is more likely than “I unilaterally”. Image creation AI such as Stable Diffusion associates words with object images. Such AI understands the words “royal residence” are more closely associated with a “castle” image than a “house” image.
Generative video AI extends image AI to create videos from text. Lumiere competitors first create keyframes, then the frames in between. This is like a master animator drawing the beginning and end images of a basketball shot, then having an assistant draw the images in between. The issue is that motion errors often occur because the in-between images aren’t drawn correctly, so Lumiere bypasses this by creating all video frames without keyframing. Also, Lumiere is trained to know what moving objects look like at various image sizes, so its videos look superior.
Technically, Lumiere utilizes diffusion probabilistic models to generate images coupled with a Space-Time U-Net, a U-net architecture with temporal up and down scaling plus attention blocks added to the usual image resolution scaling. Down-scaling temporally simultaneously with resolution significantly reduces computational workloads while up-scaling coupled with a temporally-aware, spatial super-resolution model generates the high-resolution output. Still, image frame segmentation is required due to memory limitations, so Multidiffusion is used across overlapping, frame segment boundaries to help mitigate temporal motion artifacts.
Lumiere can be coupled with other AI to create a broader range of output. This includes:
- Cinemagraphs – one section of an image is animated
- Inpainting – one object in a video is replaced by another
- Stylized generation – the appearance is re-created in another art style
- Image-to-video – a desired image is animated
- Video-to-video – videos are re-created in another art style
The video length is limited to 5 seconds while the ability to create video transitions and multiple camera angles are non-existent. Readers interested in experimenting with generative AI on their desktop computers should upgrade to a powerful video card (like this at Amazon) for the best performance during training.
Having worked at Activision, UCLA, Anime Expo and more, I’ve seen technology being used to save lives, create games, and create fantastic 3D VR/AR worlds. There’s always something fun in emerging technology that I want to get my hands on and all my friends turn to me to find the best for their needs, so I’m glad to bring my experience to Notebookcheck.