URL: https://www.progressiverobot.com/ovis-u1-multimodal-alibaba/

Introduction

The advancement of Artificial General Intelligence (AGI) towards human-level task performance is largely driven by Multimodal Large Language Models (MLLMs). Combining multiple modalities allows for greater information density in inputs and enhanced capabilities during inference. We’ve covered multiple recent multimodal image generation models such as OmniGen2, BAGEL, Ming-lite-omni, ICEdit, etc. In this article, we will cover Ovis-U1, an open-source 3-billion parameter model released by the Alibaba Ovis team, with capabilities that span across understanding multimodal inputs, generating images from text, and editing uploaded images.

Key Takeaways

model illustration for: Key Takeaways
  • Ovis-U1 is an open-source 3-billion parameter multimodal LLM from Alibaba.
  • Capabilities include multimodal understanding, text-to-image generation, and image editing.
  • The model was trained on a diverse mix of datasets for various tasks (linked below).
  • The model can be implemented on a GPU cloud servers or tested on HuggingFace Spaces.

Training Process

Stage Trained Parameters Task Steps / Batch Size / Learning Rate Description
0 Refiner + Visual Decoder Text-to-Image Generation 500 / 1024 / 1e − 4 Visual decoder pretraining, starting with random initialization to develop basic image generation capabilities. The visual decoder and refiner generate images from LLM embeddings using text-to-image data.
1 Adapter Understanding Text-to-Image Generation, Image Editing 1.5k / 8192 / 5e − 4 Adapter pretraining, aligning visual and textual embeddings. The adapter is randomly initialized and trained in this stage across understanding, text-to-image, and image editing tasks.
2 Visual Encoder + Adapter Understanding Text-to-Image Generation, Image Editing 2.6k / 8192 / 1e − 4 Visual encoder alignment, where both the visual encoder and adapter are fine-tuned to further align visual and textual representations. All three task types are used for training, with the generation task assisting in embedding alignment.
3 Visual Encoder + Adapter + LLM Understanding 23 / 2240 / 5e-5 Understanding learning, where parameters of the visual encoder, adapter, and LLM are trained on understanding tasks. These parameters are fixed after this stage to preserve understanding capability.
4 Refiner + Visual Decoder Text-to-Image Generation 275 / 256 / 5e − 5 Generation learning, training the refiner and visual decoder to align with optimized text and image embeddings after LLM parameters are tuned in Stage 3. This stage shows improved text-to-image performance.
5 Refiner + Visual Decoder Text-to-Image Generation, Image Editing 325 / 256 / 5e − 5 Generation fine-tuning, building on text-to-image capabilities by fine-tuning the decoder for both text-to-image and image editing tasks.

Data mix

Let’s take a look at the data used to train the model.

Task Datasets used Additional information
Multimodal understanding COYO Wukong Laion-5B ShareGPT4V CC3M The researchers set up a data preprocessing pipeline that removes noisy data, improves caption quality, and balances data ratios to achieve the best training performance.
Text-to-Image Generation Laion-5B JourneyDB Using Laion5B, the researchers initially select samples with an aesthetic score exceeding 6. The researchers then utilize the Qwen2-VL model to create detailed descriptions for each chosen image, resulting in the formation of the Laion-aes6 dataset.
Image+Text-to-Image Generation Image Editing OmniEdit UltraEdit SeedEdit Datasets used to improve the model's image editing capabilities
Reference-Image-Driven Image Generation Subjects200K SynCD StyleBooth Subjects200K and SynCD were used to train for subject-driven image generation and StyleBooth was used to train for style-driven image generation.
Pixel-Level Controlled Image Generation MultiGen_20M To facilitate canny-to-image (canny \= edge detection), depth-to-image, inpainting, outpainting
In-House Data Additional datasets that incorporated style-driven data, content removal, style translation, de-noise/de-blur data, colourization data, text rendering data, etc.

What about RL?

In the paper's conclusion, they acknowledge "that Ovis-U1 currently lacks a reinforcement learning stage, which has proven crucial for large model optimization. Developing effective methods to align unified multimodal models with human preferences remains an important open research question in this domain." We recently covered MMADA which introduces UniGRPO and are curious if there's an application here. Let us know what you think in the comments.

Now that we've went over the model architecture and training process, let's run the model on the cloud provider.

Implementation

Begin by spinning up a GPU Droplet. Once that’s completed, clone the repo into and install the required packages. You can do this using the following shell commands in the terminal. Alternatively, you can also try out the model on HuggingFace Spaces huggingface.co.

				
					# Install git-lfs for handling large files
apt install git-lfs

# Clone the Ovis-U1-3B repository from HuggingFace Spaces
git-lfs clone https://huggingface.co/spaces/AIDC-AI/Ovis-U1-3B

# Change directory into the cloned repository
cd Ovis-U1-3B

# Install pip for Python package management
apt install python3-pip

# Install required Python packages from requirements.txt
pip install -r requirements.txt

# Install additional Python packages for wheel and spaces
pip install wheel spaces

# Install PyTorch with CUDA 12.8 support and upgrade existing installations
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --upgrade

# Install xformers for optimized transformer operations
pip install -U xformers

# Install flash_attn for attention mechanism optimization
pip install flash_attn==2.7.4.post1

# Run the main application script
python app.py
				
			

Final Thoughts

We’re very excited about MLLMs. The datasets researchers decide to leverage, architectural modification, and how those translate to incremental improvements in capabilities is fascinating. We encourage you to test the model out. How are you using multimodal models and what use cases do you care about?

Learn more about the cloud provider’s AI offerings. We have GPU Droplets you can spin up to train your models/run inference!