As a natural multimodal content, audible video delivers an immersive sensory experience. Consequently, audio-video generation systems have substantial potential. However, existing diffusion-based studies mainly employ relatively independent modules for generating each modality, which lack exploration of shared-weight generative modules. This approach may under-use the intrinsic correlations between audio and visual modalities, potentially resulting in sub-optimal generation quality. To address this, we propose UniForm, a unified diffusion transformer designed to enhance cross-modal consistency. By concatenating auditory and visual information, UniForm learns to generate audio and video simultaneously within a unified latent space, facilitating the creation of high-quality and well-aligned audio-visual pairs. Extensive experiments demonstrate the superior performance of our method in joint audio-video generation, audio-guided video generation, and video-guided audio generation tasks.
We present generation results from our proposed UniForm across multiple audio-video synthesis conditions. Each set demonstrates three key capabilities under identical labels or captions: text-to-audible video (T2AV), audio-to-video (A2V), and video-to-audio (V2A).
The video captures a dramatic scene of a volcanic eruption. The volcano, located in the center of the frame, is spewing a large amount of lava and ash into the sky. The lava is bright orange and appears to be in the process of solidifying, creating a fiery spectacle. The surrounding area is shrouded in a layer of smoke, adding to the intensity of the scene. The sky above is a dark gray, contrasting with the bright colors of the lava and ash. The overall atmosphere is one of raw power and natural beauty.
The image captures a serene scene of a backyard setting. Dominating the scene is a blue plastic pool, filled with water, where three ducks are enjoying a swim. The pool is surrounded by a lush green lawn, providing a stark contrast to the blue of the pool. In the background, a blue plastic container can be seen, possibly used for storing pool accessories. A wooden birdhouse is also present, adding to the charm of the backyard. The ducks, with their white feathers, are the main focus of the image. They are swimming in the pool, their heads submerged in the water, seemingly enjoying the coolness. The image exudes a sense of tranquility and harmony with nature.
The video captures a vibrant scene of a yellow and black frog in its natural habitat. The frog, with its distinctive yellow and black markings, is perched on a rock, its body facing the camera. It appears to be in motion, possibly walking or jumping, as suggested by the blurred background. The rock on which the frog sits is surrounded by a lush green moss, adding a touch of color to the scene. The frog’s position on the rock and its interaction with the moss suggest it is in a natural environment, possibly a forest or a jungle. The video is a snapshot of a moment in the life of this beautiful creature, providing a glimpse into its behavior and habitat.
The video shows a young child sitting in a high chair, smiling and laughing. The child appears to be a toddler, with light-colored hair and is wearing a brown shirt. The high chair is white and has a tray in front of the child. The background is blurred but suggests an indoor setting with kitchen appliances and cabinets visible. The child’s joyful expression and the candid nature of the shot convey a sense of warmth and happiness.
The video captures a serene tropical beach scene. The beach is lined with tall, lush green palm trees that sway gently in the breeze. The water is a vibrant blue, reflecting the clear sky above. The sandy beach is dotted with a few scattered trees and rocks, adding to the natural beauty of the scene. The perspective of the video is from the water, looking towards the shore, giving a sense of being in the middle of the ocean. The overall atmosphere is one of tranquility and natural beauty.
The video captures a vibrant scene of a red passenger train journeying along a track. The train, adorned with a black roof and white lettering, is composed of multiple cars, each one a testament to the journey it embarks on. The train is moving from the left to the right of the frame, its journey marked by the lush green grass that lines the track. The sky above is a clear blue, dotted with fluffy white clouds, adding to the serene atmosphere of the scene. The image is a beautiful blend of man-made marvel and natural beauty.
The video features a single gray parrot perched on a white object, which appears to be a piece of furniture with a wooden frame. The parrot is facing the camera, and its feathers are a mix of gray and white, with a darker gray head and a lighter gray body. The bird’s beak is black, and it has a small red patch on its tail. The background is a plain, light-colored wall, and there is a wooden door frame visible to the right of the parrot. The lighting in the room is soft and diffused, suggesting an indoor setting. There are no visible texts or other objects in the image. The style of the video is a straightforward, unembellished photograph with no additional effects or filters applied.
The image captures a serene moment in nature, featuring a bird feeder filled with a variety of birdseed. The feeder is situated in a tranquil setting, surrounded by trees and a small pond. The birds, a mix of species, are seen perched on the feeder, pecking at the seeds. The colors of the birds vary, adding to the diversity of the scene. The sky above is overcast, casting a soft light over the entire scene. The image is taken from a distance, allowing for a comprehensive view of the feeder and its surroundings. The overall atmosphere is peaceful and natural, showcasing the beauty of wildlife in its natural habitat.
The image captures a serene moment in nature, featuring a small bird perched on a branch of a pine tree. The bird, with its brown and white plumage, is facing the camera, its beak open as if it’s singing or communicating with its surroundings. The pine tree, adorned with needles, stands tall against the backdrop of a clear blue sky. The bird’s position on the branch suggests it’s in the middle of the tree, possibly enjoying the warmth of the sun. The overall scene is peaceful and harmonious, a snapshot of life in the wild.
The video shows a man sitting in a room, deeply engrossed in playing a guitar. He is wearing a black jacket and has a serious expression on his face. The guitar he is playing has a wooden body and a black pickguard. The room has a wooden wall and a door, and the man is seated on a chair. The lighting in the room is dim, creating a cozy atmosphere. The man’s fingers are moving over the strings of the guitar, indicating that he is playing a song. The overall scene suggests a quiet, intimate moment of music-making.
The image is a screenshot from a video game, depicting a female character riding a horse in a virtual environment. The character is wearing a pink shirt and a purple helmet, and she is holding a riding crop in her right hand. The horse is white with a black mane and tail. The rider is in motion, as indicated by the blurred background and the motion-blur effect on the horse’s legs. The environment is a fenced-in area with trees and a clear sky. There are several poles and flags visible, suggesting a course or obstacle for the rider to navigate. The game interface is visible at the bottom of the screen, showing the character’s health and stamina bars, as well as a timer that reads “0:00” and “0:20,” indicating the current time and the elapsed time in the game. The style of the image is reminiscent of early 3D graphics, with a limited color palette and a pixelated quality. The overall impression is that of a retro or early 2000s video game.
The video captures a moment of a large military helicopter in flight. The helicopter, painted in a dark color, is equipped with a rotor system at the top, which is in motion, indicating that the helicopter is either taking off or landing. The helicopter is positioned in the center of the frame, with the background featuring a runway and a line of trees. The sky above is overcast, suggesting an overcast day. The helicopter appears to be in motion, as suggested by the blurred background and the motion of the rotor system. The overall scene suggests a military operation or training exercise.
The video captures a serene scene of a man enjoying a boat ride on a calm lake. The man, dressed in a black shirt and blue jeans, is seated in the back of a small white boat. The boat is moving away from the camera, creating a sense of motion. The lake is surrounded by lush green trees and hills, adding to the tranquility of the scene. The sky above is clear, suggesting a sunny day. The man appears to be alone in the boat, adding to the peacefulness of the moment. The overall scene is a beautiful representation of a leisurely day out on the water.
The video is a close-up shot of a firework display. The firework is in the process of exploding, with bright, fiery colors and sparks visible against a dark background. The firework is located in the center of the frame, and the camera is positioned at a low angle, looking up towards the firework. The surrounding area is dark, emphasizing the brightness of the firework. There are no visible texts or distinctive marks that provide additional context or information about the location or event. The style of the video is realistic, capturing the dynamic motion and vibrant colors of the firework.
The video shows a single chicken with a distinctive red comb and wattle, sitting in a wooden coop. The chicken appears to be resting or sleeping, with its head tucked under its wing. The coop has a wooden floor and walls, and there is a small amount of straw on the floor. The lighting in the coop is warm, suggesting it might be indoors or in a sheltered area. The chicken’s feathers are a mix of dark and light colors, typical of many chicken breeds. There are no visible texts or other objects in the image. The style of the video is a straightforward, unembellished capture of a single moment in the life of the chicken.
The image captures a dramatic moment of a rocket launch. The rocket, with its long, slender body and pointed nose, is seen ascending into the sky, leaving behind a trail of smoke. The sky is a deep, inky black, providing a stark contrast to the rocket’s white trail. The rocket is positioned in the center of the image, drawing the viewer’s attention immediately. The smoke trail extends upwards and to the right, indicating the rocket’s trajectory. The background is a dark, starry sky, adding to the sense of depth and vastness. The image is a still from a video, as indicated by the play button overlay in the bottom right corner. The overall scene is a testament to human ingenuity and the power of rocket technology.