Mixtrain Dataset Format v1.0

A dataset format for storing multimodal temporal sequences: synchronized time-indexed data spanning video, sensor signals, actions, and metadata. Each row is one temporal sequence (a robot episode, a video clip, a driving segment, a simulation rollout). The format uses Parquet files with array columns for structured signals, file references for external data, and a lightweight metadata convention. Works locally as standalone Parquet files or cloud-native with open table formats.

Motivation

Working with multimodal data in robotics, video generation, world models, and autonomous driving means dealing with time-synchronized signals (video feeds, sensor readings, action sequences, scalar metrics) that need to stay accessible throughout the entire data lifecycle. Curating and training have fundamentally different access patterns: curating needs point reads, filtering, and browsing individual sequences; training needs high-throughput sequential reads across the entire dataset. Columnar storage means readers only load the columns they need, whether that's scalar metadata for filtering or the full row for training, from the same files.

Robotics

sequence_idrobot_idtaskrewardactionobs_statevideo_cam_high
ep_001go1pick-n-place42.3float[200][12]float[200][48]s3://.../1.mp4

World Models

sequence_idenvironmentvideo_inputvideo_outputactionlatent_state
sim_001kitchen_v2s3://.../mp4s3://.../mp4float[90][6]float[90][256]

Vision-Language Models

sequence_idinstructionresponsevideoimagebbox
vlm_001"Describe this video""A person is pick…"s3://.../mp4s3://.../f.pngfloat[30][4]

Format Specification

A Mixtrain dataset is a collection of Parquet files where each row represents one temporal sequence. All time-indexed data for a sequence lives in that row, as array columns for structured signals and as file references for external data.

Columns fall into three categories:

Scalar Metadata Columns

Standard Parquet/Arrow primitive types. Used for filtering, aggregation, and display.

sequence_id      : string              — unique, required
duration_seconds : float
reward           : float
length           : int                 — number of timesteps
source           : string              — "real", "sim", "augmented"
split            : string              — "train", "val", "test"
label            : string
created_at       : timestamp

All scalar columns are optional except sequence_id. Teams add domain-specific columns as needed (e.g., robot, task, scene, vehicle, prompt, location, operator, success).

Temporal Sequence Columns

Each cell contains a full time-indexed signal for that sequence. The Parquet type is list<list<float>> where the outer list is time and the inner list is the signal dimensions. T is the number of timesteps (can vary per sequence).

action           : list<list<float>>   — [T, action_dim]
obs_state        : list<list<float>>   — [T, obs_dim]
camera_pose      : list<list<float>>   — [T, 7] (position + quaternion)
steering         : list<list<float>>   — [T, 1]
latent_state     : list<list<float>>   — [T, latent_dim]

File Reference Columns

String columns containing paths or URLs pointing to external files. Each file corresponds to exactly one sequence.

video_cam_high   : string              — path/URL to MP4
video_cam_wrist  : string              — path/URL to MP4
audio            : string              — path/URL to audio file
mcap             : string              — path/URL to MCAP file

Paths with a scheme (s3://, gs://, https://) are used as-is; paths without a scheme are resolved relative to the dataset root.

Dataset-Level Metadata

Metadata describes the dataset format and per-column semantics. The content is the same regardless of storage, only the location differs:

{
  "mixtrain": "1.0",
  "column_types": {
    "action": { "type": "trajectory", "dim": 12 },
    "obs_state": { "type": "trajectory", "dim": 48 },
    "video_cam_high": { "type": "video", "fps": 30, "codec": "h264" },
    "video_cam_wrist": { "type": "video", "fps": 30, "codec": "h264" },
    "audio": { "type": "audio" }
  }
}

The mixtrain key is both the format identifier and version. Its presence marks a dataset as Mixtrain format.

column_types only includes columns that have semantics beyond what the Parquet schema provides (file reference types, temporal sequence types, etc.). Regular scalar columns (reward, environment, split) are omitted.

Common types:

TypeDescriptionCommon extra fields
trajectoryTime-indexed array of structured signalsdim, units, coordinate_frame
videoVideo file referencefps, codec, resolution
imageImage file referenceformat, resolution
audioAudio file referencesample_rate, channels
3d3D spatial data (point cloud, mesh, depth map, etc.)format, coordinate_frame, units
4dTemporal 3D data (dynamic point clouds, animated meshes, volumetric video)format, fps, coordinate_frame
embeddingVector embedding arraydim, model

The type field is an open string, not restricted to the list above. Any file type or data format that a reader knows how to handle is valid (e.g., mcap, zarr, nifti, rerun, hdf5). Readers should ignore types they don't recognize. Additional fields beyond type are optional and type-specific; readers should ignore unrecognized fields.

Synchronization: Video fps is the temporal alignment key. Frame i of a video column corresponds to index i in any temporal sequence column. If cameras have different frame rates, they declare it independently and the reader handles alignment.

Only mixtrain (version) is required. column_types is included when columns have semantics beyond their Parquet type.

Storage Layout

Standalone Parquet (example):

{dataset_root}/
├── data/
│   ├── part-00000.parquet             # Sequence rows (scalar + array columns)
│   ├── part-00001.parquet
│   └── ...
├── files/                              # Referenced files (optional convention)
│   ├── cam_high/
│   │   ├── ep_00001.mp4
│   │   └── ...
│   └── cam_wrist/
│       └── ...
└── mixtrain.json                      # Dataset metadata (standalone only)

Referenced files can live anywhere: co-located under the dataset root (relative paths) or in external storage (absolute URLs). The format doesn't prescribe a directory structure.

With a data lake: The data/ directory is managed by the lake (Iceberg manifests, Delta transaction log, etc.). File reference columns point to wherever the files live.

Data Lake Integrations

Mixtrain leverages open data lake formats to add features like schema evolution, time travel, concurrent writes, and partition pruning.

Apache Iceberg

Iceberg is an open table format for large analytic datasets. When Mixtrain data files are managed by Iceberg:

Delta Lake

Delta Lake is an open storage framework with ACID transactions, originally created by Databricks.

When Mixtrain data files are managed by Delta Lake:

Delta Lake does not have an equivalent to Puffin files for custom indexes. The sequence offset index would need to be managed as a separate Delta table or sidecar file.

DuckLake

DuckLake is a lightweight, embedded data lake built on DuckDB. No external metastore required.

When Mixtrain data files are managed by DuckLake:

Tradeoffs & Limitations

Per-sequence rows vs per-timestep rows. Storing one row per sequence makes browsing, filtering, and comparing sequences natural, but means sequences have variable lengths within a batch, frame-level random access requires reading full sequences first, and you can't stream-append timesteps to an in-progress sequence (buffer the full sequence, then write the row). Per-timestep formats (like LeRobot v3) avoid this but make sequence-level operations awkward. In practice, most training already uses fixed-length windows (observation horizon + prediction horizon) sliced from sequences, so the slicing logic lives in the dataloader either way.

Array columns are opaque to query engines. Parquet predicate pushdown, bloom filters, and statistics work on scalar columns but not on array contents. Query engines like DuckDB can still access array elements via functions (e.g., list_extract), but this requires a full column scan. For typical dataset sizes (tens of thousands of sequences with hundreds of timesteps each), this is fast. For frequently filtered properties at larger scale, extract them as scalar columns.

External file references vs embedded blobs. Parquet doesn't efficiently store large binary data, so video, point clouds, and other large files are stored externally and referenced by path/URL. This means no transactional consistency between the table and referenced files: orphaned files need cleanup, and moving a dataset means moving both. Write the file first, then insert the row, to avoid dangling references. Embedding large files inline (as Lance supports) eliminates this but introduces other tradeoffs: row sizes become highly variable, columnar scan performance degrades when rows contain multi-megabyte blobs, and files can't be accessed independently of the table format. Iceberg's File Format API may offer the best of both worlds in the future (see Appendix).

Compressed video vs individual frames. Storing video as compressed MP4 (H.264) gives 20-50x storage savings over individual frames, but requires sequential decoding; seeking to an arbitrary frame means decoding from the nearest keyframe. Storing individual frames (as JPEG/PNG in a column or as separate files) gives instant random access at the cost of much larger storage. The format supports either approach.

Implementation Notes

Appendix: Pluggable File Formats

The Mixtrain format is defined at the schema convention level, not the file format level. The underlying storage can evolve. Apache Iceberg is working on making the File Format API extensible, which would allow alternative file formats like Lance (optimized for random access and blob storage) and Vortex (optimized for scan performance and adaptive encoding) to be used as drop-in replacements for Parquet. The migration would be transparent, no schema or application changes required.

References