World Models

Post-training for world models

World models learn to simulate environments — predicting future states, understanding physics, and enabling agents to plan. Mixtrain provides the data and evaluation infrastructure to train them at scale.

World models need physics-aware evaluation

Image metrics don't capture what matters for world models. You need to measure long-horizon consistency, action-conditioned prediction accuracy, and physical plausibility — not just perceptual similarity between frames.

Most teams evaluate world models with repurposed video metrics and manual spot-checks. Mixtrain replaces that with structured evaluation built for environment simulation — from single-step prediction through multi-step rollouts to sim-to-real transfer.

Evaluation built for simulation

World models serve two purposes: predicting what happens next and enabling agents to plan. Mixtrain evaluates both with metrics that go beyond reconstruction error.

Prediction

  • Future state accuracy across single-step and multi-step horizons
  • Physics plausibility scoring — conservation laws, collision dynamics, gravity
  • Rollout stability metrics over extended time horizons
  • Action-conditioned prediction accuracy across diverse scenarios

Planning

  • Trajectory quality evaluation for model-based planning agents
  • Reward prediction accuracy and value function calibration
  • Sim-to-real transfer benchmarks with domain gap analysis
  • Counterfactual reasoning and branching scenario evaluation

What you get

Multi-modal environment datasets

Version and manage environment datasets spanning vision, proprioception, actions, and rewards. Slice by environment, episode, or transition.

Physics-aware evaluation

Evaluate predictions against physical constraints — conservation of energy, rigid body dynamics, contact forces, and object permanence.

Long-horizon benchmarks

Test rollout stability over hundreds of steps. Track error accumulation, drift detection, and compounding prediction failures.

Action-conditioned training

Build training pipelines that condition on action sequences. Support for discrete, continuous, and hierarchical action spaces.

Distributed training

Launch multi-node training jobs with your own scripts. Automatic checkpointing, configurable compute, and full experiment tracking.

Latent space analysis

Visualize and analyze learned representations. Track latent space structure, disentanglement metrics, and representation quality across training.

Sim-to-real evaluation

Benchmark transfer performance with structured domain gap analysis. Compare simulation predictions against real-world trajectories.

Production export

Export models with optimized inference configs, latency baselines, and regression monitoring. ONNX and TensorRT support.

Start building

Train world models that understand physics. Free to start, no credit card required.