Physical AI Models
Run and fine-tune state-of-the-art robotics foundation models on Mixtrain. These Vision-Language-Action (VLA) models enable robots to understand instructions, perceive their environment, and generate actions — all from a single model.
Example Models
| Model | Provider | Parameters | Description |
|---|---|---|---|
| GR00T N1.6 | NVIDIA | — | Foundation model for generalist humanoid robots with dual-system architecture |
| pi0.5 | Physical Intelligence | — | Generalist robot policy with open-world generalization |
| pi0-FAST | Physical Intelligence | — | Autoregressive VLA with FAST action tokenizer |
| SmolVLA | Hugging Face (LeRobot) | 450M | Compact VLA that runs on consumer hardware, even a MacBook |
Quick Start
from mixtrain import Model
# Load a robotics foundation model
model = Model("smolvla-base")
# Fine-tune on your robot's demonstration data
model.train(
dataset="my-workspace/robot-demos",
steps=20000
)NVIDIA GR00T N1.6
NVIDIA's Isaac GR00T N1.6 is a Vision-Language-Action model with a dual-system architecture inspired by human cognition:
- System 1 (Fast): A Diffusion Transformer running at 120 Hz for real-time motor actions.
- System 2 (Slow): A Vision-Language Model for high-level reasoning at 10 Hz.
GR00T N1.6 is trained on a mix of real-robot trajectories, human videos, and synthetic data. It generalizes across tasks like grasping, bimanual manipulation, and multi-step object handling.
model = Model("groot-n1.6")
model.train(
dataset="my-workspace/humanoid-demos",
steps=50000
)Physical Intelligence pi0.5
pi0.5 is a generalist robot policy from Physical Intelligence, trained on 10,000+ hours of robot data. It uses knowledge insulation to generalize to entirely new environments not seen during training — like cleaning up a kitchen in a new home.
model = Model("pi0.5")
# Fine-tune for your specific robot and environment
model.train(
dataset="my-workspace/kitchen-tasks",
steps=30000
)The lighter pi0-FAST variant uses the FAST action tokenizer for faster autoregressive inference.
Hugging Face SmolVLA
SmolVLA is a 450M parameter VLA from Hugging Face's LeRobot project. Despite its small size, it matches or exceeds larger models in both simulation and real-world tasks. Its asynchronous inference stack enables robots to complete 2x more tasks within fixed time constraints.
model = Model("smolvla-base")
# Fine-tune with as few as 50 demonstration episodes
model.train(
dataset="my-workspace/pick-and-place",
steps=20000
)SmolVLA is pretrained on 10 million frames from 487 community datasets on the Hugging Face Hub, spanning diverse environments from labs to living rooms.