Mixtrain Dataset Format v1.0

A dataset format for storing multimodal temporal sequences: synchronized time-indexed data spanning video, sensor signals, actions, and metadata. Each row is one temporal sequence (a robot episode, a video clip, a driving segment, a simulation rollout). The format uses Parquet files with array columns for structured signals, file references for external data, and a lightweight metadata convention. Works locally as standalone Parquet files or cloud-native with open table formats.

Motivation

Working with multimodal data in robotics, video generation, world models, and autonomous driving means dealing with time-synchronized signals (video feeds, sensor readings, action sequences, scalar metrics) that need to stay accessible throughout the entire data lifecycle. Curating and training have fundamentally different access patterns: curating needs point reads, filtering, and browsing individual sequences; training needs high-throughput sequential reads across the entire dataset. Columnar storage means readers only load the columns they need, whether that's scalar metadata for filtering or the full row for training, from the same files.

Robotics

sequence_id	robot_id	task	reward	action	obs_state	video_cam_high
ep_001	go1	pick-n-place	42.3	float[200][12]	float[200][48]	s3://.../1.mp4

World Models

sequence_id	environment	video_input	video_output	action	latent_state
sim_001	kitchen_v2	s3://.../mp4	s3://.../mp4	float[90][6]	float[90][256]

Vision-Language Models

sequence_id	instruction	response	video	image	bbox
vlm_001	"Describe this video"	"A person is pick…"	s3://.../mp4	s3://.../f.png	float[30][4]

Format Specification

A Mixtrain dataset is a collection of Parquet files where each row represents one temporal sequence. All time-indexed data for a sequence lives in that row, as array columns for structured signals and as file references for external data.

Columns fall into three categories:

Scalar Metadata Columns

Standard Parquet/Arrow primitive types. Used for filtering, aggregation, and display.

sequence_id      : string              — unique, required
duration_seconds : float
reward           : float
length           : int                 — number of timesteps
source           : string              — "real", "sim", "augmented"
split            : string              — "train", "val", "test"
label            : string
created_at       : timestamp

All scalar columns are optional except sequence_id. Teams add domain-specific columns as needed (e.g., robot, task, scene, vehicle, prompt, location, operator, success).

Temporal Sequence Columns

Each cell contains a full time-indexed signal for that sequence. The Parquet type is list<list<float>> where the outer list is time and the inner list is the signal dimensions. T is the number of timesteps (can vary per sequence).

action           : list<list<float>>   — [T, action_dim]
obs_state        : list<list<float>>   — [T, obs_dim]
camera_pose      : list<list<float>>   — [T, 7] (position + quaternion)
steering         : list<list<float>>   — [T, 1]
latent_state     : list<list<float>>   — [T, latent_dim]

File Reference Columns

String columns containing paths or URLs pointing to external files. Each file corresponds to exactly one sequence.

video_cam_high   : string              — path/URL to MP4
video_cam_wrist  : string              — path/URL to MP4
audio            : string              — path/URL to audio file
mcap             : string              — path/URL to MCAP file

Paths with a scheme (s3://, gs://, https://) are used as-is; paths without a scheme are resolved relative to the dataset root.

Dataset-Level Metadata

Metadata describes the dataset format and per-column semantics. The content is the same regardless of storage, only the location differs:

Data lake (Iceberg, Delta, DuckLake): stored as a single table property mixtrain containing a JSON string
Standalone Parquet: stored as a mixtrain.json sidecar file alongside the data

{
  "mixtrain": "1.0",
  "column_types": {
    "action": { "type": "trajectory", "dim": 12 },
    "obs_state": { "type": "trajectory", "dim": 48 },
    "video_cam_high": { "type": "video", "fps": 30, "codec": "h264" },
    "video_cam_wrist": { "type": "video", "fps": 30, "codec": "h264" },
    "audio": { "type": "audio" }
  }
}

The mixtrain key is both the format identifier and version. Its presence marks a dataset as Mixtrain format.

column_types only includes columns that have semantics beyond what the Parquet schema provides (file reference types, temporal sequence types, etc.). Regular scalar columns (reward, environment, split) are omitted.

Common types:

Type	Description	Common extra fields
`trajectory`	Time-indexed array of structured signals	`dim`, `units`, `coordinate_frame`
`video`	Video file reference	`fps`, `codec`, `resolution`
`image`	Image file reference	`format`, `resolution`
`audio`	Audio file reference	`sample_rate`, `channels`
`3d`	3D spatial data (point cloud, mesh, depth map, etc.)	`format`, `coordinate_frame`, `units`
`4d`	Temporal 3D data (dynamic point clouds, animated meshes, volumetric video)	`format`, `fps`, `coordinate_frame`
`embedding`	Vector embedding array	`dim`, `model`

The type field is an open string, not restricted to the list above. Any file type or data format that a reader knows how to handle is valid (e.g., mcap, zarr, nifti, rerun, hdf5). Readers should ignore types they don't recognize. Additional fields beyond type are optional and type-specific; readers should ignore unrecognized fields.

Synchronization: Video fps is the temporal alignment key. Frame i of a video column corresponds to index i in any temporal sequence column. If cameras have different frame rates, they declare it independently and the reader handles alignment.

Only mixtrain (version) is required. column_types is included when columns have semantics beyond their Parquet type.

Storage Layout

Standalone Parquet (example):

{dataset_root}/
├── data/
│   ├── part-00000.parquet             # Sequence rows (scalar + array columns)
│   ├── part-00001.parquet
│   └── ...
├── files/                              # Referenced files (optional convention)
│   ├── cam_high/
│   │   ├── ep_00001.mp4
│   │   └── ...
│   └── cam_wrist/
│       └── ...
└── mixtrain.json                      # Dataset metadata (standalone only)

Referenced files can live anywhere: co-located under the dataset root (relative paths) or in external storage (absolute URLs). The format doesn't prescribe a directory structure.

With a data lake: The data/ directory is managed by the lake (Iceberg manifests, Delta transaction log, etc.). File reference columns point to wherever the files live.

Data Lake Integrations

Mixtrain leverages open data lake formats to add features like schema evolution, time travel, concurrent writes, and partition pruning.

Apache Iceberg

Iceberg is an open table format for large analytic datasets. When Mixtrain data files are managed by Iceberg:

Schema evolution — add new columns without rewriting existing data; old sequences get NULL for new columns
Time travel & versioning — every write creates an immutable snapshot for rollback and reproducibility
Partition pruning — partition by scalar columns (e.g., environment, task, split) for efficient subset loading
Hidden partitioning — queries don't need to know the partition scheme; Iceberg prunes transparently
Concurrent writes — multiple processes (data collection, eval workflows, human labeling) write with ACID guarantees
Interoperability — the same table is readable by DuckDB, Spark, Trino/Presto, PyArrow, and Polars
Pluggable file formats — Iceberg is working on making the File Format API extensible, enabling Lance and Vortex as alternatives to Parquet (see Appendix)
Puffin-based sequence offset index — Iceberg's Puffin files can store custom blobs for sequence-level byte offsets, enabling O(1) random access from cloud storage

Delta Lake

Delta Lake is an open storage framework with ACID transactions, originally created by Databricks.

When Mixtrain data files are managed by Delta Lake:

Versioning: Delta's transaction log provides time travel and audit history
ACID writes: Concurrent readers and writers with optimistic concurrency
Schema evolution: Add/rename columns with Delta's schema evolution support
Spark ecosystem: Native integration with Databricks and Apache Spark
Unity Catalog: Discoverability and governance in Databricks environments

Delta Lake does not have an equivalent to Puffin files for custom indexes. The sequence offset index would need to be managed as a separate Delta table or sidecar file.

DuckLake

DuckLake is a lightweight, embedded data lake built on DuckDB. No external metastore required.

When Mixtrain data files are managed by DuckLake:

DuckDB-native: Zero-config, embedded; metadata lives in a DuckDB database
Fast local queries: DuckDB's vectorized execution engine for interactive exploration

Tradeoffs & Limitations

Per-sequence rows vs per-timestep rows. Storing one row per sequence makes browsing, filtering, and comparing sequences natural, but means sequences have variable lengths within a batch, frame-level random access requires reading full sequences first, and you can't stream-append timesteps to an in-progress sequence (buffer the full sequence, then write the row). Per-timestep formats (like LeRobot v3) avoid this but make sequence-level operations awkward. In practice, most training already uses fixed-length windows (observation horizon + prediction horizon) sliced from sequences, so the slicing logic lives in the dataloader either way.

Array columns are opaque to query engines. Parquet predicate pushdown, bloom filters, and statistics work on scalar columns but not on array contents. Query engines like DuckDB can still access array elements via functions (e.g., list_extract), but this requires a full column scan. For typical dataset sizes (tens of thousands of sequences with hundreds of timesteps each), this is fast. For frequently filtered properties at larger scale, extract them as scalar columns.

External file references vs embedded blobs. Parquet doesn't efficiently store large binary data, so video, point clouds, and other large files are stored externally and referenced by path/URL. This means no transactional consistency between the table and referenced files: orphaned files need cleanup, and moving a dataset means moving both. Write the file first, then insert the row, to avoid dangling references. Embedding large files inline (as Lance supports) eliminates this but introduces other tradeoffs: row sizes become highly variable, columnar scan performance degrades when rows contain multi-megabyte blobs, and files can't be accessed independently of the table format. Iceberg's File Format API may offer the best of both worlds in the future (see Appendix).

Compressed video vs individual frames. Storing video as compressed MP4 (H.264) gives 20-50x storage savings over individual frames, but requires sequential decoding; seeking to an arbitrary frame means decoding from the nearest keyframe. Storing individual frames (as JPEG/PNG in a column or as separate files) gives instant random access at the cost of much larger storage. The format supports either approach.

Implementation Notes

Use fixed-size inner lists (e.g., list<float32>[12]) for temporal sequence columns to enable zero-copy conversion to tensors
For maximum throughput on dense arrays, consider BYTE_ARRAY encoding with raw float32 bytes to eliminate Parquet repetition/definition level overhead
Read at shard level (full Parquet files, not individual rows) for remote storage
Shuffle at shard level, then sequence level within cached shards
Cache Parquet shards locally; first epoch downloads, subsequent epochs read from disk
Video decode is typically the training bottleneck; use hardware decoders (NVDEC) when available
For frame-level training, use a shuffle buffer across sequences; for sequence-level training (transformers, diffusion policies), slice fixed-length windows directly from arrays

Appendix: Pluggable File Formats

The Mixtrain format is defined at the schema convention level, not the file format level. The underlying storage can evolve. Apache Iceberg is working on making the File Format API extensible, which would allow alternative file formats like Lance (optimized for random access and blob storage) and Vortex (optimized for scan performance and adaptive encoding) to be used as drop-in replacements for Parquet. The migration would be transparent, no schema or application changes required.