URL: https://www.progressiverobot.com/vision-language-action-finetuning-robotics/

Introduction

Vision-Language-Action (VLA) models represent a breakthrough in embodied AI, combining visual perception, language understanding, and robotic control into unified models that can follow natural language instructions to perform physical tasks. This tutorial covers the complete process of fine-tuning VLA models for your specific robotic applications.

What are Vision-Language-Action Models?

VLAs combine computer vision (interpreting images/videos), natural language processing (understanding and generating text), and action execution (interacting with environments or systems). This allows them to perceive, reason, and act based on both visual and textual inputs.

If you want to see VLAs in action, check out the Google DeepMind Robotics Lab Tour with Hannah Fry.

Popular VLA architectures include OpenVLA, RT-2 (Robotic Transformer 2), and PaLM-E.

Key Takeaways

VLA models unify vision, language, and action by mapping camera images and natural language instructions directly to robot actions in an end-to-end architecture
Use LoRA for efficient fine-tuning, training only 0.1-0.2% of parameters instead of the full model, enabling training on a single GPU with 24GB VRAM

Prerequisites

Hardware Requirements

GPU with at least 24GB VRAM (RTX 3090/4090, A100, or H100)
Robot hardware or simulation environment
Camera(s) for visual input when running inference
Robot hardware or simulation environment

Step 0: Set up GPU and Jupyter Notebook

Set up a GPU cloud servers and then SSHroot@[IPV4] in your terminal

In Terminal:

				
					# Core dependencies
pip install torch torchvision
pip install transformers accelerate
pip install datasets
pip install wandb  # for experiment tracking

				
					pip install jupyter lab
jupyter lab

Step 1: Understanding the VLA Architecture

Key players: Vision Encoder: Processes camera images into visual embeddings Language Encoder: Converts text instructions into language embeddings Fusion Module: Combines visual and language information Action Decoder: Predicts robot actions from fused representations

Step 2: Preparing Your Dataset

Data Collection

Your dataset should contain episodes with:

Images: RGB camera observations at each timestep
Language: Natural language task descriptions
Actions: Robot action sequences (joint positions, velocities, or end-effector poses)
Metadata: Success labels, episode length, etc. etc.

Data Format Example

				
					{
    "episode_0": {
        "images": [img_0, img_1, ..., img_T],  # shape: (T, H, W, 3) where T is total frames
        "language": "pick up the red block",
        "actions": [act_0, act_1, ..., act_T],  # shape: (T, action_dim)
        "success": True
    }
}

Dataset Organization

				
					from datasets import Dataset, DatasetDict
import numpy as np

def create_vla_dataset(episodes):
    """Convert robot episodes to HuggingFace dataset format"""
    
    data = {
        "images": [],
        "language": [],
        "actions": [],
        "episode_id": []
    }
    
    for ep_id, episode in enumerate(episodes):
        for t in range(len(episode['images'])):
            data['images'].append(episode['images'][t])
            data['language'].append(episode['language'])
            data['actions'].append(episode['actions'][t])
            data['episode_id'].append(ep_id)
    
    return Dataset.from_dict(data)

# Split into train/val
dataset = create_vla_dataset(your_episodes)
dataset = dataset.train_test_split(test_size=0.1)

Step 3: Setting Up the Base Model

Loading a Pre-trained VLA

				
					from transformers import AutoModel, AutoProcessor
import torch

# Load pre-trained VLA (example using OpenVLA)
model_name = "openvla/openvla-7b"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Freeze base model parameters (optional for LoRA)
for param in model.parameters():
    param.requires_grad = False

Action Space Adaptation

Your robot's action space likely differs from the pre-training data:

				
					from torch import nn

class ActionHead(nn.Module):
    def __init__(self, hidden_dim, action_dim):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(512, action_dim)
        )
    
    def forward(self, x):
        return self.fc(x)

# Add custom action head
model.action_head = ActionHead(
    hidden_dim=model.config.hidden_size,
    action_dim=7  # e.g., 6-DOF arm + gripper
)

Step 4: Efficient Fine-Tuning with LoRA

Low-Rank Adaptation (LoRA) enables efficient fine-tuning by reducing the number of trainable parameters:

				
					from peft import LoraConfig, get_peft_model

# Configure LoRA
lora_config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
#Expected output: trainable params: 8.3M || all params: 7B || trainable%: 0.12%

Step 5: Data Processing and Augmentation

Image Preprocessing

				
					from torchvision import transforms

image_transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                        std=[0.229, 0.224, 0.225])
])

Action Normalization

				
					def normalize_actions(actions, stats):
    """Normalize actions to [-1, 1] range"""
    return (actions - stats['mean']) / (stats['std'] + 1e-8)

def denormalize_actions(normalized_actions, stats):
    """Convert back to original action space"""
    return normalized_actions * (stats['std'] + 1e-8) + stats['mean']

# Calculate statistics from your dataset
action_stats = {
    'mean': dataset['train']['actions'].mean(axis=0),
    'std': dataset['train']['actions'].std(axis=0)
}

DataLoader Setup

				
					from torch.utils.data import DataLoader

def collate_fn(batch):
    """Custom collate function for VLA data"""
    images = torch.stack([processor.image_processor(b['images']) 
                          for b in batch])
    text = processor.tokenizer([b['language'] for b in batch], 
                               padding=True, return_tensors="pt")
    actions = torch.tensor([b['actions'] for b in batch], 
                           dtype=torch.float32)
    
    return {
        'pixel_values': images,
        'input_ids': text['input_ids'],
        'attention_mask': text['attention_mask'],
        'actions': actions
    }

train_loader = DataLoader(
    dataset['train'],
    batch_size=32,
    shuffle=True,
    collate_fn=collate_fn,
    num_workers=4
)

Step 6: Training Loop

Loss Function

				
					import torch.nn.functional as F

def vla_loss(predicted_actions, target_actions, reduction='mean'):
    """Action prediction loss (MSE for continuous actions)"""
    mse_loss = F.mse_loss(predicted_actions, target_actions, reduction=reduction)
    return mse_loss

# Alternative: Action chunking for temporal coherence
def chunked_action_loss(pred_chunks, target_chunks, chunk_size=10):
    """Predict multiple future actions at once"""
    loss = 0
    for i in range(chunk_size):
        loss += F.mse_loss(pred_chunks[:, i], target_chunks[:, i])
    return loss / chunk_size

Training Script

				
					from accelerate import Accelerator
from tqdm import tqdm
import wandb

# Initialize accelerator for distributed training
accelerator = Accelerator(mixed_precision='bf16')

# Setup
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10000)

# Prepare for distributed training
model, optimizer, train_loader = accelerator.prepare(
    model, optimizer, train_loader
)

# Initialize wandb
wandb.init(project="vla-finetuning", config={
    "learning_rate": 1e-4,
    "batch_size": 32,
    "epochs": 10
})

# Training loop
global_step = 0
for epoch in range(10):
    model.train()
    epoch_loss = 0
    
    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
        # Forward pass
        outputs = model(
            pixel_values=batch['pixel_values'],
            input_ids=batch['input_ids'],
            attention_mask=batch['attention_mask']
        )
        
        # Predict actions
        predicted_actions = model.action_head(outputs.last_hidden_state[:, -1])
        
        # Compute loss
        loss = vla_loss(predicted_actions, batch['actions'])
        
        # Backward pass
        accelerator.backward(loss)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
        # Logging
        epoch_loss += loss.item()
        global_step += 1
        
        if global_step % 100 == 0:
            wandb.log({
                "loss": loss.item(),
                "learning_rate": scheduler.get_last_lr()[0],
                "epoch": epoch
            })
    
    print(f"Epoch {epoch+1} - Average Loss: {epoch_loss / len(train_loader):.4f}")
    
    # Save checkpoint
    if (epoch + 1) % 2 == 0:
        accelerator.save_model(model, f"checkpoints/epoch_{epoch+1}")

Step 7: Evaluation

Simulation Evaluation

				
					def evaluate_in_simulation(model, env, num_episodes=50):
    """Evaluate model in simulation environment"""
    model.eval()
    success_count = 0
    
    with torch.no_grad():
        for episode in range(num_episodes):
            obs = env.reset()
            instruction = env.get_task_instruction()
            done = False
            
            while not done:
                # Process observation
                image = torch.tensor(obs['image']).unsqueeze(0)
                text_inputs = processor.tokenizer(instruction, return_tensors="pt")
                
                # Predict action
                outputs = model(
                    pixel_values=image.to(model.device),
                    input_ids=text_inputs['input_ids'].to(model.device)
                )
                action = model.action_head(outputs.last_hidden_state[:, -1])
                
                # Denormalize and execute
                action = denormalize_actions(action.cpu().numpy(), action_stats)
                obs, reward, done, info = env.step(action[0])
            
            if info['success']:
                success_count += 1
    
    success_rate = success_count / num_episodes
    print(f"Success Rate: {success_rate*100:.1f}%")
    return success_rate

Behavioural Cloning Metrics

				
					def evaluate_bc_metrics(model, val_loader):
    """Evaluate behavioural cloning performance"""
    model.eval()
    total_mse = 0
    total_cosine_sim = 0
    n_samples = 0
    
    with torch.no_grad():
        for batch in val_loader:
            outputs = model(
                pixel_values=batch['pixel_values'],
                input_ids=batch['input_ids']
            )
            pred_actions = model.action_head(outputs.last_hidden_state[:, -1])
            
            # MSE
            mse = F.mse_loss(pred_actions, batch['actions'])
            total_mse += mse.item() * len(batch['actions'])
            
            # Cosine similarity
            cosine_sim = F.cosine_similarity(pred_actions, batch['actions']).mean()
            total_cosine_sim += cosine_sim.item() * len(batch['actions'])
            
            n_samples += len(batch['actions'])
    
    return {
        'mse': total_mse / n_samples,
        'cosine_similarity': total_cosine_sim / n_samples
    }

Step 8: Deployment

Model Export

				
					# Save fine-tuned model
model.save_pretrained("./finetuned_vla")
processor.save_pretrained("./finetuned_vla")

# For deployment, merge LoRA weights
from peft import PeftModel

base_model = AutoModel.from_pretrained(model_name)
merged_model = PeftModel.from_pretrained(base_model, "./finetuned_vla")
merged_model = merged_model.merge_and_unload()
merged_model.save_pretrained("./deployed_model")

Real-time Inference

				
					class VLAController:
    def __init__(self, model_path):
        self.model = AutoModel.from_pretrained(model_path)
        self.processor = AutoProcessor.from_pretrained(model_path)
        self.model.eval()
        self.model.to('cuda')
    
    @torch.inference_mode()
    def predict_action(self, image, instruction):
        """Real-time action prediction"""
        # Preprocess
        inputs = self.processor(
            images=image,
            text=instruction,
            return_tensors="pt"
        ).to('cuda')
        
        # Predict
        outputs = self.model(**inputs)
        action = self.model.action_head(outputs.last_hidden_state[:, -1])
        
        return action.cpu().numpy()[0]

# Usage
controller = VLAController("./deployed_model")
action = controller.predict_action(camera_image, "pick up the cup")
robot.execute_action(action)

Resources

OpenVLA
RT-2 Paper and RT-2 website: This is the work that coined the term VLA (Vision-Language-Action model)
PEFT Library
RoboMimic (for dataset handling)

OpenVLA: LeRobot Research Presentation #5 by Moo Jin Kim

Conclusion

Fine-tuning VLA models enables robots to perform specialized tasks with natural language control. We hope this tutorial shows you how you can adapt pre-trained models to your specific hardware and tasks, leveraging the power of large-scale pre-training while customizing for your application. Here, you want to start with simulation, iterate quickly, and gradually transition to real hardware as your model improves.