Google has unveiled Gemini Omni at its I/O developer conference — a new multimodal model family that can combine text, images, audio, and video to generate high-quality video output. CEO Sundar Pichai said the model represents the next step toward AI that can create anything from any input. The first model, Omni Flash, is available today in the Gemini app, YouTube Shorts, and Google's Flow video editor.
What Gemini Omni Does
Gemini Omni is not a simple video generator. It reasons across multiple input types simultaneously. Users can feed it a combination of images, audio clips, text prompts, and existing video. Rather than stitching those inputs together mechanically, the model understands relationships between them and produces coherent output that reflects physics, culture, history, and science.
Google DeepMind's chief technologist demonstrated the capability with a prompt for "a claymation explainer of protein folding." Omni Flash rendered a stop-motion video complete with a voice-over accurately explaining amino acid chains, alpha helices, and beta sheets. The model understood both the visual style requested and the scientific content needed.
The model also handles photo editing through plain text commands — similar to Google's Nano Banana image model. Users can describe what they want changed. The AI makes the edit. No Photoshop skills required.
Digital Avatars With Deepfake Protection
Omni introduces personal digital avatars. Users record themselves speaking a series of numbers during onboarding. The system creates a digital version that can be placed in AI-generated videos. Want to make a video of yourself winning an award or standing on the moon? Omni generates it with your avatar.
To prevent misuse, all Omni-generated videos include Google's SynthID watermark. The watermark lets anyone verify whether a video was created with AI tools. The protection matters given the growing concern about deepfakes and the recent Oscar ban on AI-generated performances.
More Than a Veo Upgrade
Google already has Veo, a dedicated video generation model. Omni is architecturally different. DeepMind's director of product management said this is not a Veo update but the next step toward combining Gemini's intelligence with the rendering capabilities of Google's media models.
The distinction matters. Veo generates video from text and images. Omni reasons across all input types. The long-term vision includes generating images from audio, audio from video, and any combination of outputs from any combination of inputs. Video is just the starting point.
Consumer First, Enterprise Later
Google is positioning Omni Flash as a consumer tool initially. The examples center on personal use — vacation videos, personalized memes, social media content. Omni Flash renders up to 10 seconds of video. Longer durations are coming.
The enterprise and creative implications are obvious. An end-to-end multimodal workflow could transform advertising, filmmaking, and content production. Startup Luma AI is building something similar with agentic tools for ad campaigns. ComfyUI targets professional creators who want granular control. Google's Omni targets everyone else.
An API will be available in the coming weeks for developers. The Omni Pro model — expected to deliver a step change above Flash — will follow when Google is satisfied with the quality gap.
Part of the Gemini Everywhere Strategy
Omni is the latest addition to Google's comprehensive AI rollout. Gemini now powers Chrome, Maps, Workspace, YouTube, Google TV, millions of cars, Photos, Gboard, Android widgets, and search. Adding video generation to that stack gives Google a multimodal creative capability that no other platform matches at scale.
Pichai framed the vision in terms of world models. AI is moving from predicting text to simulating reality. Gemini Omni is the next step in that direction.
What It Means
Gemini Omni represents Google's most ambitious AI model launch since the original Gemini release three years ago. If the model delivers on its promise — coherent video from any combination of inputs, with digital avatars and deepfake protection built in — it could define the next generation of AI creative tools.
The AI image data already shows that visual AI drives consumer adoption 6.5x more than text model upgrades. Omni extends that advantage from images to video. For the AI industry, the message from Google I/O is clear: the future of AI is not just about understanding language. It is about creating reality.







