MixtrainDocsBlog

Physical AI Models

Run and fine-tune state-of-the-art robotics foundation models on Mixtrain. These Vision-Language-Action (VLA) models enable robots to understand instructions, perceive their environment, and generate actions — all from a single model.

Example Models

ModelProviderParametersDescription
GR00T N1.6NVIDIAFoundation model for generalist humanoid robots with dual-system architecture
pi0.5Physical IntelligenceGeneralist robot policy with open-world generalization
pi0-FASTPhysical IntelligenceAutoregressive VLA with FAST action tokenizer
SmolVLAHugging Face (LeRobot)450MCompact VLA that runs on consumer hardware, even a MacBook

Quick Start

from mixtrain import Workflow

# Fine-tune a robotics foundation model on your robot's demonstration data
workflow = Workflow("robot-policy-finetune")
result = workflow.run(
    dataset="my-workspace/robot-demos",
    base_model="smolvla-base",
    steps=20000,
)

NVIDIA GR00T N1.6

NVIDIA's Isaac GR00T N1.6 is a Vision-Language-Action model with a dual-system architecture inspired by human cognition:

  • System 1 (Fast): A Diffusion Transformer running at 120 Hz for real-time motor actions.
  • System 2 (Slow): A Vision-Language Model for high-level reasoning at 10 Hz.

GR00T N1.6 is trained on a mix of real-robot trajectories, human videos, and synthetic data. It generalizes across tasks like grasping, bimanual manipulation, and multi-step object handling.

from mixtrain import Workflow

workflow = Workflow("robot-policy-finetune")
result = workflow.run(
    dataset="my-workspace/humanoid-demos",
    base_model="groot-n1.6",
    steps=50000,
)

Physical Intelligence pi0.5

pi0.5 is a generalist robot policy from Physical Intelligence, trained on 10,000+ hours of robot data. It uses knowledge insulation to generalize to entirely new environments not seen during training — like cleaning up a kitchen in a new home.

from mixtrain import Workflow

# Fine-tune for your specific robot and environment
workflow = Workflow("robot-policy-finetune")
result = workflow.run(
    dataset="my-workspace/kitchen-tasks",
    base_model="pi0.5",
    steps=30000,
)

The lighter pi0-FAST variant uses the FAST action tokenizer for faster autoregressive inference.

Hugging Face SmolVLA

SmolVLA is a 450M parameter VLA from Hugging Face's LeRobot project. Despite its small size, it matches or exceeds larger models in both simulation and real-world tasks. Its asynchronous inference stack enables robots to complete 2x more tasks within fixed time constraints.

from mixtrain import Workflow

# Fine-tune with as few as 50 demonstration episodes
workflow = Workflow("robot-policy-finetune")
result = workflow.run(
    dataset="my-workspace/pick-and-place",
    base_model="smolvla-base",
    steps=20000,
)

SmolVLA is pretrained on 10 million frames from 487 community datasets on the Hugging Face Hub, spanning diverse environments from labs to living rooms.

On this page