MixtrainDocsBlog

Mixtrain provides dataset management with rich multimodal support. View images, videos, audio, 3D models, and more directly in the UI. Explore data with embedding visualizations and curate using SQL queries.

Quick Start

from mixtrain import Dataset

# Load existing dataset
dataset = Dataset("training-data")
df = dataset.to_pandas()
print(f"Rows: {dataset.row_count}")

# Iterate without loading full dataset
for row in dataset:
    print(row)

# Create from any source
ds = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})
ds = Dataset.from_huggingface("imdb", split="train")

Creating Datasets

From files

Upload data from local files:

from mixtrain import Dataset

dataset = Dataset.from_file(
    name="training-data",
    file_path="data.parquet",
    description="Training dataset"
)

Supported formats: .parquet, .csv, .tsv

From in-memory data

Create datasets from various Python sources:

from mixtrain import Dataset

# From Python dict
ds = Dataset.from_dict({"x": [1, 2, 3], "y": ["a", "b", "c"]})

# From pandas DataFrame
ds = Dataset.from_pandas(df)

# From Arrow table
ds = Dataset.from_arrow(table)

# From HuggingFace datasets
ds = Dataset.from_huggingface("imdb", split="train")

# From PyTorch dataset
ds = Dataset.from_torch(torch_dataset)

# Save to platform
ds.save("my-dataset", description="My dataset")

Column Types

Column types control rich rendering in the UI (images, video players, audio waveforms, etc.).

Auto-detection (default) — When you call save(), types are automatically inferred from data content by inspecting URLs, file extensions, and value patterns:

from mixtrain import Dataset

# image_url detected as Image, video_url as Video, etc.
ds = Dataset.from_file("data.csv")
ds.save("my-data")

Explicit overrides — Pass a dict to override specific columns. Auto-detection still runs for the rest:

from mixtrain import Dataset, Image, Video, Audio, Embedding

ds = Dataset.from_file("data.csv")
ds.save("my-data", column_types={
    "photo": Image,        # explicit override
    "sound": Audio,        # explicit override
    # other columns still auto-detected
})

Disable auto-detection — Pass None to skip type inference entirely:

ds.save("my-data", column_types=None)

Supported types: Image, Video, Audio, Model3D, Text, Markdown, JSON, Embedding, MCAP, Rerun

Viewing Multimodal Data

The web UI renders multimodal content directly:

  • Images - Thumbnails with full-size preview
  • Videos - Inline playback with controls
  • Audio - Waveform visualization with playback
  • 3D Models - Interactive 3D viewer
  • Embeddings - Dimensionality-reduced visualizations

Browse datasets in the web UI to view and explore your data visually.

Iterating Over Data

Row-by-row

Stream rows without loading the full dataset:

for row in dataset:
    print(row["text"], row["label"])

Batched

Get batches as columnar dicts:

for batch in dataset.to_batches(size=32):
    texts = batch["text"]  # List of 32 texts
    labels = batch["label"]  # List of 32 labels

PyTorch Integration

DataLoader

Get a PyTorch DataLoader with zero-copy tensor conversion:

# Unbatched
loader = dataset.to_torch()
for row in loader:
    print(row)

# Batched with tensors
loader = dataset.to_torch(batch_size=32)
for batch in loader:
    features = batch["features"]  # torch.Tensor
    labels = batch["labels"]      # torch.Tensor

Direct tensor conversion

tensors = dataset.to_tensors()
print(tensors["label"])  # tensor([0, 1, 0, 1, ...])

Transformations

All transformations return new datasets (immutable):

ds = Dataset("training-data")

# Shuffle and sample
shuffled = ds.shuffle(seed=42)
sample = ds.sample(100, seed=42)

# Filter and map
positive = ds.filter(lambda x: x["label"] == 1)
with_length = ds.map(lambda x: {**x, "text_len": len(x["text"])})

# Select columns and rows
subset = ds.cols(["text", "label"]).head(100)

# Chain operations
processed = ds.shuffle(42).filter(lambda x: x["score"] > 0.8).head(1000)
processed.save("processed-data")

SQL Queries

Single dataset

# Filter with SQL
filtered = dataset.query("SELECT * FROM data WHERE score > 0.8")

# Aggregations
stats = dataset.query("SELECT label, COUNT(*) as cnt FROM data GROUP BY label")

Multiple datasets

result = Dataset.query_multiple({
    "users": Dataset("users"),
    "orders": Dataset("orders"),
}, "SELECT * FROM users u JOIN orders o ON u.id = o.user_id")

Joining Datasets

users = Dataset("users")
orders = Dataset("orders")

# Inner join
joined = users.join(orders, keys="user_id")

# Left outer join
joined = users.join(orders, keys="user_id", join_type="left outer")

# Different key names
joined = users.join(orders, keys="id", right_keys="user_id")

Train/Test Splits

splits = dataset.train_test_split(test_size=0.2, seed=42)

train_loader = splits["train"].to_torch(batch_size=32)
test_loader = splits["test"].to_torch(batch_size=32)

Embedding Visualization

Datasets with embedding columns can be visualized for exploration:

  • 2D/3D scatter plots - UMAP/t-SNE projections
  • Clustering - Discover data patterns
  • Similarity search - Find similar items
  • Outlier detection - Identify anomalies

Use embedding visualizations to understand your data distribution and identify curation opportunities.

Using Datasets in Training

from mixtrain import Dataset

# Load and prepare data
ds = Dataset("training-data")
splits = ds.shuffle(42).train_test_split(test_size=0.2)

# Get PyTorch DataLoaders
train_loader = splits["train"].to_torch(batch_size=32)
val_loader = splits["test"].to_torch(batch_size=32)

# Training loop
for batch in train_loader:
    inputs = batch["features"]  # tensor
    labels = batch["labels"]    # tensor
    # ... training step

Using Datasets in Evaluations

from mixtrain import Model, Dataset

model = Model("flux-pro")
dataset = Dataset("test-prompts")

# Run model on each row
for row in dataset:
    result = model.run({"prompt": row["prompt"]})
    print(result.image.url)

Next Steps

On this page