Store and query datasets using Delta Lake table format with ACID transactions.
Setup
1. Add Provider
mixtrain provider add deltaOr via SDK:
from mixtrain import MixClient
client = MixClient()
client.create_dataset_provider(
provider_type="delta",
secrets={
# Configure your cloud storage credentials
}
)Creating Datasets
from mixtrain import Dataset
dataset = Dataset.create_from_file(
name="training-data",
file_path="data.parquet",
description="Training dataset"
)Querying Datasets
from mixtrain import Dataset
dataset = Dataset("training-data")
# Convert to pandas DataFrame
df = dataset.to_pandas()Features
- ACID Transactions - Full transaction support for reliable data operations
- Time Travel - Query data at any point in history
- Schema Evolution - Add or modify columns without rewriting data
CLI
# Create dataset
mixtrain dataset create my-data data.parquet --provider delta
mixtrain dataset query my-data "SELECT * LIMIT 100"
# View metadata
mixtrain dataset metadata my-data