MixtrainDocsBlog

Physical AI Models

Run and fine-tune state-of-the-art robotics foundation models on Mixtrain. These Vision-Language-Action (VLA) models enable robots to understand instructions, perceive their environment, and generate actions — all from a single model.

Example Models

ModelProviderParametersDescription
GR00T N1.6NVIDIAFoundation model for generalist humanoid robots with dual-system architecture
pi0.5Physical IntelligenceGeneralist robot policy with open-world generalization
pi0-FASTPhysical IntelligenceAutoregressive VLA with FAST action tokenizer
SmolVLAHugging Face (LeRobot)450MCompact VLA that runs on consumer hardware, even a MacBook

Quick Start

from mixtrain import Model

# Load a robotics foundation model
model = Model("smolvla-base")

# Fine-tune on your robot's demonstration data
model.train(
    dataset="my-workspace/robot-demos",
    steps=20000
)

NVIDIA GR00T N1.6

NVIDIA's Isaac GR00T N1.6 is a Vision-Language-Action model with a dual-system architecture inspired by human cognition:

  • System 1 (Fast): A Diffusion Transformer running at 120 Hz for real-time motor actions.
  • System 2 (Slow): A Vision-Language Model for high-level reasoning at 10 Hz.

GR00T N1.6 is trained on a mix of real-robot trajectories, human videos, and synthetic data. It generalizes across tasks like grasping, bimanual manipulation, and multi-step object handling.

model = Model("groot-n1.6")

model.train(
    dataset="my-workspace/humanoid-demos",
    steps=50000
)

Physical Intelligence pi0.5

pi0.5 is a generalist robot policy from Physical Intelligence, trained on 10,000+ hours of robot data. It uses knowledge insulation to generalize to entirely new environments not seen during training — like cleaning up a kitchen in a new home.

model = Model("pi0.5")

# Fine-tune for your specific robot and environment
model.train(
    dataset="my-workspace/kitchen-tasks",
    steps=30000
)

The lighter pi0-FAST variant uses the FAST action tokenizer for faster autoregressive inference.

Hugging Face SmolVLA

SmolVLA is a 450M parameter VLA from Hugging Face's LeRobot project. Despite its small size, it matches or exceeds larger models in both simulation and real-world tasks. Its asynchronous inference stack enables robots to complete 2x more tasks within fixed time constraints.

model = Model("smolvla-base")

# Fine-tune with as few as 50 demonstration episodes
model.train(
    dataset="my-workspace/pick-and-place",
    steps=20000
)

SmolVLA is pretrained on 10 million frames from 487 community datasets on the Hugging Face Hub, spanning diverse environments from labs to living rooms.

On this page