Store and query large-scale datasets using Apache Iceberg table format.
Setup
1. Add Provider
mixtrain provider add apache_icebergOr via SDK:
from mixtrain import MixClient
client = MixClient()
client.create_dataset_provider(
provider_type="apache_iceberg",
secrets={
# Configure your cloud storage credentials
}
)Creating Datasets
from mixtrain import Dataset
# Upload a Parquet file
dataset = Dataset.create_from_file(
name="training-data",
file_path="data.parquet",
description="Training dataset"
)Querying Datasets
from mixtrain import Dataset
dataset = Dataset("training-data")
# Convert to pandas DataFrame
df = dataset.to_pandas()Features
- Time Travel - Query data at any point in history
- Schema Evolution - Add or modify columns without rewriting data
- ACID Transactions - Full transaction support for reliable operations
CLI
mixtrain dataset create my-data data.parquet
# Query
mixtrain dataset query my-data "SELECT * WHERE score > 0.9"
mixtrain dataset metadata my-data