Mixtrain provides dataset management with rich multimodal support. View images, videos, audio, 3D models, and more directly in the UI. Explore data with embedding visualizations and curate using SQL queries.
Quick Start
from mixtrain import Dataset
# Load existing dataset
dataset = Dataset("training-data")
df = dataset.to_pandas()
print(f"Rows: {dataset.row_count}")
# Iterate without loading full dataset
for row in dataset:
print(row)
# Create from any source
ds = Dataset.from_dict({"text": ["hello", "world"], "label": [0, 1]})
ds = Dataset.from_huggingface("imdb", split="train")Creating Datasets
From files
Upload data from local files:
from mixtrain import Dataset
dataset = Dataset.from_file(
name="training-data",
file_path="data.parquet",
description="Training dataset"
)Supported formats: .parquet, .csv, .tsv
From in-memory data
Create datasets from various Python sources:
from mixtrain import Dataset
# From Python dict
ds = Dataset.from_dict({"x": [1, 2, 3], "y": ["a", "b", "c"]})
# From pandas DataFrame
ds = Dataset.from_pandas(df)
# From Arrow table
ds = Dataset.from_arrow(table)
# From HuggingFace datasets
ds = Dataset.from_huggingface("imdb", split="train")
# From PyTorch dataset
ds = Dataset.from_torch(torch_dataset)
# Save to platform
ds.save("my-dataset", description="My dataset")Column Types
Specify column types for rich rendering in the UI:
from mixtrain import Dataset, Image, Video, Audio, Embedding
Dataset.from_file(
name="my-data",
file_path="data.csv",
column_types={
"image_url": Image,
"video_url": Video,
"audio_url": Audio,
"embedding": Embedding
}
)Supported types: Image, Video, Audio, Model3D, Text, Markdown, JSON, Embedding
Viewing Multimodal Data
The web UI renders multimodal content directly:
- Images - Thumbnails with full-size preview
- Videos - Inline playback with controls
- Audio - Waveform visualization with playback
- 3D Models - Interactive 3D viewer
- Embeddings - Dimensionality-reduced visualizations
Browse datasets in the web UI to view and explore your data visually.
Iterating Over Data
Row-by-row
Stream rows without loading the full dataset:
for row in dataset:
print(row["text"], row["label"])Batched
Get batches as columnar dicts:
for batch in dataset.to_batches(size=32):
texts = batch["text"] # List of 32 texts
labels = batch["label"] # List of 32 labelsPyTorch Integration
DataLoader
Get a PyTorch DataLoader with zero-copy tensor conversion:
# Unbatched
loader = dataset.to_torch()
for row in loader:
print(row)
# Batched with tensors
loader = dataset.to_torch(batch_size=32)
for batch in loader:
features = batch["features"] # torch.Tensor
labels = batch["labels"] # torch.TensorDirect tensor conversion
tensors = dataset.to_tensors()
print(tensors["label"]) # tensor([0, 1, 0, 1, ...])Transformations
All transformations return new datasets (immutable):
ds = Dataset("training-data")
# Shuffle and sample
shuffled = ds.shuffle(seed=42)
sample = ds.sample(100, seed=42)
# Filter and map
positive = ds.filter(lambda x: x["label"] == 1)
with_length = ds.map(lambda x: {**x, "text_len": len(x["text"])})
# Select columns and rows
subset = ds.cols(["text", "label"]).head(100)
# Chain operations
processed = ds.shuffle(42).filter(lambda x: x["score"] > 0.8).head(1000)
processed.save("processed-data")SQL Queries
Single dataset
# Filter with SQL
filtered = dataset.query("SELECT * FROM data WHERE score > 0.8")
# Aggregations
stats = dataset.query("SELECT label, COUNT(*) as cnt FROM data GROUP BY label")Multiple datasets
result = Dataset.query_multiple({
"users": Dataset("users"),
"orders": Dataset("orders"),
}, "SELECT * FROM users u JOIN orders o ON u.id = o.user_id")Joining Datasets
users = Dataset("users")
orders = Dataset("orders")
# Inner join
joined = users.join(orders, keys="user_id")
# Left outer join
joined = users.join(orders, keys="user_id", join_type="left outer")
# Different key names
joined = users.join(orders, keys="id", right_keys="user_id")Train/Test Splits
splits = dataset.train_test_split(test_size=0.2, seed=42)
train_loader = splits["train"].to_torch(batch_size=32)
test_loader = splits["test"].to_torch(batch_size=32)Embedding Visualization
Datasets with embedding columns can be visualized for exploration:
- 2D/3D scatter plots - UMAP/t-SNE projections
- Clustering - Discover data patterns
- Similarity search - Find similar items
- Outlier detection - Identify anomalies
Use embedding visualizations to understand your data distribution and identify curation opportunities.
Using Datasets in Training
from mixtrain import Dataset
# Load and prepare data
ds = Dataset("training-data")
splits = ds.shuffle(42).train_test_split(test_size=0.2)
# Get PyTorch DataLoaders
train_loader = splits["train"].to_torch(batch_size=32)
val_loader = splits["test"].to_torch(batch_size=32)
# Training loop
for batch in train_loader:
inputs = batch["features"] # tensor
labels = batch["labels"] # tensor
# ... training stepUsing Datasets in Evaluations
from mixtrain import Model, Dataset
model = Model("flux-pro")
dataset = Dataset("test-prompts")
# Run model on each row
for row in dataset:
result = model.run({"prompt": row["prompt"]})
print(result.image.url)Next Steps
- Dataset API Reference - Complete SDK documentation
- CLI Reference - Command-line interface