Table of Contents
URL: https://www.progressiverobot.com/vision-language-action-finetuning-robotics/
Introduction
Vision-Language-Action (VLA) models represent a breakthrough in embodied AI, combining visual perception, language understanding, and robotic control into unified models that can follow natural language instructions to perform physical tasks. This tutorial covers the complete process of fine-tuning VLA models for your specific robotic applications.
What are Vision-Language-Action Models?
VLAs combine computer vision (interpreting images/videos), natural language processing (understanding and generating text), and action execution (interacting with environments or systems). This allows them to perceive, reason, and act based on both visual and textual inputs.
If you want to see VLAs in action, check out the Google DeepMind Robotics Lab Tour with Hannah Fry.
Popular VLA architectures include OpenVLA, RT-2 (Robotic Transformer 2), and PaLM-E.
Key Takeaways
- VLA models unify vision, language, and action by mapping camera images and natural language instructions directly to robot actions in an end-to-end architecture
- Use LoRA for efficient fine-tuning, training only 0.1-0.2% of parameters instead of the full model, enabling training on a single GPU with 24GB VRAM
Prerequisites
Hardware Requirements
- GPU with at least 24GB VRAM (RTX 3090/4090, A100, or H100)
- Robot hardware or simulation environment
- Camera(s) for visual input when running inference
- Robot hardware or simulation environment
Step 0: Set up GPU and Jupyter Notebook
Set up a GPU cloud servers and then SSHroot@[IPV4] in your terminal
In Terminal:
# Core dependencies
pip install torch torchvision
pip install transformers accelerate
pip install datasets
pip install wandb # for experiment tracking
pip install jupyter lab
jupyter lab
Step 1: Understanding the VLA Architecture
Key players: Vision Encoder: Processes camera images into visual embeddings Language Encoder: Converts text instructions into language embeddings Fusion Module: Combines visual and language information Action Decoder: Predicts robot actions from fused representations
Step 2: Preparing Your Dataset
Data Collection
Your dataset should contain episodes with:
- Images: RGB camera observations at each timestep
- Language: Natural language task descriptions
- Actions: Robot action sequences (joint positions, velocities, or end-effector poses)
- Metadata: Success labels, episode length, etc. etc.
Data Format Example
{
"episode_0": {
"images": [img_0, img_1, ..., img_T], # shape: (T, H, W, 3) where T is total frames
"language": "pick up the red block",
"actions": [act_0, act_1, ..., act_T], # shape: (T, action_dim)
"success": True
}
}
Dataset Organization
from datasets import Dataset, DatasetDict
import numpy as np
def create_vla_dataset(episodes):
"""Convert robot episodes to HuggingFace dataset format"""
data = {
"images": [],
"language": [],
"actions": [],
"episode_id": []
}
for ep_id, episode in enumerate(episodes):
for t in range(len(episode['images'])):
data['images'].append(episode['images'][t])
data['language'].append(episode['language'])
data['actions'].append(episode['actions'][t])
data['episode_id'].append(ep_id)
return Dataset.from_dict(data)
# Split into train/val
dataset = create_vla_dataset(your_episodes)
dataset = dataset.train_test_split(test_size=0.1)
Step 3: Setting Up the Base Model
Loading a Pre-trained VLA
from transformers import AutoModel, AutoProcessor
import torch
# Load pre-trained VLA (example using OpenVLA)
model_name = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Freeze base model parameters (optional for LoRA)
for param in model.parameters():
param.requires_grad = False
Action Space Adaptation
Your robot's action space likely differs from the pre-training data:
from torch import nn
class ActionHead(nn.Module):
def __init__(self, hidden_dim, action_dim):
super().__init__()
self.fc = nn.Sequential(
nn.Linear(hidden_dim, 512),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(512, action_dim)
)
def forward(self, x):
return self.fc(x)
# Add custom action head
model.action_head = ActionHead(
hidden_dim=model.config.hidden_size,
action_dim=7 # e.g., 6-DOF arm + gripper
)
Step 4: Efficient Fine-Tuning with LoRA
Low-Rank Adaptation (LoRA) enables efficient fine-tuning by reducing the number of trainable parameters:
from peft import LoraConfig, get_peft_model
# Configure LoRA
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
#Expected output: trainable params: 8.3M || all params: 7B || trainable%: 0.12%
Step 5: Data Processing and Augmentation
Image Preprocessing
from torchvision import transforms
image_transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
Action Normalization
def normalize_actions(actions, stats):
"""Normalize actions to [-1, 1] range"""
return (actions - stats['mean']) / (stats['std'] + 1e-8)
def denormalize_actions(normalized_actions, stats):
"""Convert back to original action space"""
return normalized_actions * (stats['std'] + 1e-8) + stats['mean']
# Calculate statistics from your dataset
action_stats = {
'mean': dataset['train']['actions'].mean(axis=0),
'std': dataset['train']['actions'].std(axis=0)
}
DataLoader Setup
from torch.utils.data import DataLoader
def collate_fn(batch):
"""Custom collate function for VLA data"""
images = torch.stack([processor.image_processor(b['images'])
for b in batch])
text = processor.tokenizer([b['language'] for b in batch],
padding=True, return_tensors="pt")
actions = torch.tensor([b['actions'] for b in batch],
dtype=torch.float32)
return {
'pixel_values': images,
'input_ids': text['input_ids'],
'attention_mask': text['attention_mask'],
'actions': actions
}
train_loader = DataLoader(
dataset['train'],
batch_size=32,
shuffle=True,
collate_fn=collate_fn,
num_workers=4
)
Step 6: Training Loop
Loss Function
import torch.nn.functional as F
def vla_loss(predicted_actions, target_actions, reduction='mean'):
"""Action prediction loss (MSE for continuous actions)"""
mse_loss = F.mse_loss(predicted_actions, target_actions, reduction=reduction)
return mse_loss
# Alternative: Action chunking for temporal coherence
def chunked_action_loss(pred_chunks, target_chunks, chunk_size=10):
"""Predict multiple future actions at once"""
loss = 0
for i in range(chunk_size):
loss += F.mse_loss(pred_chunks[:, i], target_chunks[:, i])
return loss / chunk_size
Training Script
from accelerate import Accelerator
from tqdm import tqdm
import wandb
# Initialize accelerator for distributed training
accelerator = Accelerator(mixed_precision='bf16')
# Setup
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10000)
# Prepare for distributed training
model, optimizer, train_loader = accelerator.prepare(
model, optimizer, train_loader
)
# Initialize wandb
wandb.init(project="vla-finetuning", config={
"learning_rate": 1e-4,
"batch_size": 32,
"epochs": 10
})
# Training loop
global_step = 0
for epoch in range(10):
model.train()
epoch_loss = 0
for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
# Forward pass
outputs = model(
pixel_values=batch['pixel_values'],
input_ids=batch['input_ids'],
attention_mask=batch['attention_mask']
)
# Predict actions
predicted_actions = model.action_head(outputs.last_hidden_state[:, -1])
# Compute loss
loss = vla_loss(predicted_actions, batch['actions'])
# Backward pass
accelerator.backward(loss)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
# Logging
epoch_loss += loss.item()
global_step += 1
if global_step % 100 == 0:
wandb.log({
"loss": loss.item(),
"learning_rate": scheduler.get_last_lr()[0],
"epoch": epoch
})
print(f"Epoch {epoch+1} - Average Loss: {epoch_loss / len(train_loader):.4f}")
# Save checkpoint
if (epoch + 1) % 2 == 0:
accelerator.save_model(model, f"checkpoints/epoch_{epoch+1}")
Step 7: Evaluation
Simulation Evaluation
def evaluate_in_simulation(model, env, num_episodes=50):
"""Evaluate model in simulation environment"""
model.eval()
success_count = 0
with torch.no_grad():
for episode in range(num_episodes):
obs = env.reset()
instruction = env.get_task_instruction()
done = False
while not done:
# Process observation
image = torch.tensor(obs['image']).unsqueeze(0)
text_inputs = processor.tokenizer(instruction, return_tensors="pt")
# Predict action
outputs = model(
pixel_values=image.to(model.device),
input_ids=text_inputs['input_ids'].to(model.device)
)
action = model.action_head(outputs.last_hidden_state[:, -1])
# Denormalize and execute
action = denormalize_actions(action.cpu().numpy(), action_stats)
obs, reward, done, info = env.step(action[0])
if info['success']:
success_count += 1
success_rate = success_count / num_episodes
print(f"Success Rate: {success_rate*100:.1f}%")
return success_rate
Behavioural Cloning Metrics
def evaluate_bc_metrics(model, val_loader):
"""Evaluate behavioural cloning performance"""
model.eval()
total_mse = 0
total_cosine_sim = 0
n_samples = 0
with torch.no_grad():
for batch in val_loader:
outputs = model(
pixel_values=batch['pixel_values'],
input_ids=batch['input_ids']
)
pred_actions = model.action_head(outputs.last_hidden_state[:, -1])
# MSE
mse = F.mse_loss(pred_actions, batch['actions'])
total_mse += mse.item() * len(batch['actions'])
# Cosine similarity
cosine_sim = F.cosine_similarity(pred_actions, batch['actions']).mean()
total_cosine_sim += cosine_sim.item() * len(batch['actions'])
n_samples += len(batch['actions'])
return {
'mse': total_mse / n_samples,
'cosine_similarity': total_cosine_sim / n_samples
}
Step 8: Deployment
Model Export
# Save fine-tuned model
model.save_pretrained("./finetuned_vla")
processor.save_pretrained("./finetuned_vla")
# For deployment, merge LoRA weights
from peft import PeftModel
base_model = AutoModel.from_pretrained(model_name)
merged_model = PeftModel.from_pretrained(base_model, "./finetuned_vla")
merged_model = merged_model.merge_and_unload()
merged_model.save_pretrained("./deployed_model")
Real-time Inference
class VLAController:
def __init__(self, model_path):
self.model = AutoModel.from_pretrained(model_path)
self.processor = AutoProcessor.from_pretrained(model_path)
self.model.eval()
self.model.to('cuda')
@torch.inference_mode()
def predict_action(self, image, instruction):
"""Real-time action prediction"""
# Preprocess
inputs = self.processor(
images=image,
text=instruction,
return_tensors="pt"
).to('cuda')
# Predict
outputs = self.model(**inputs)
action = self.model.action_head(outputs.last_hidden_state[:, -1])
return action.cpu().numpy()[0]
# Usage
controller = VLAController("./deployed_model")
action = controller.predict_action(camera_image, "pick up the cup")
robot.execute_action(action)
Resources
- OpenVLA
- RT-2 Paper and RT-2 website: This is the work that coined the term VLA (Vision-Language-Action model)
- PEFT Library
- RoboMimic (for dataset handling)
Conclusion
Fine-tuning VLA models enables robots to perform specialized tasks with natural language control. We hope this tutorial shows you how you can adapt pre-trained models to your specific hardware and tasks, leveraging the power of large-scale pre-training while customizing for your application. Here, you want to start with simulation, iterate quickly, and gradually transition to real hardware as your model improves.