MixtrainDocs

Store and query large-scale datasets using Apache Iceberg table format.

Setup

1. Add Provider

mixtrain provider add apache_iceberg

Or via SDK:

from mixtrain import MixClient

client = MixClient()
client.create_dataset_provider(
    provider_type="apache_iceberg",
    secrets={
        # Configure your cloud storage credentials
    }
)

Creating Datasets

from mixtrain import Dataset

# Upload a Parquet file
dataset = Dataset.create_from_file(
    name="training-data",
    file_path="data.parquet",
    description="Training dataset"
)

Querying Datasets

from mixtrain import Dataset

dataset = Dataset("training-data")

# Convert to pandas DataFrame
df = dataset.to_pandas()

Features

  • Time Travel - Query data at any point in history
  • Schema Evolution - Add or modify columns without rewriting data
  • ACID Transactions - Full transaction support for reliable operations

CLI

mixtrain dataset create my-data data.parquet

# Query
mixtrain dataset query my-data "SELECT * WHERE score > 0.9"

mixtrain dataset metadata my-data

On this page