MixtrainDocs

Connect to SQL databases for dataset storage and querying.

Supported Databases

DatabaseProvider Type
PostgreSQLpostgresql
MySQLmysql
Snowflakesnowflake
BigQuerybigquery
Databricksdatabricks

Setup

PostgreSQL

mixtrain provider add postgresql
client.create_dataset_provider(
    provider_type="postgresql",
    secrets={
        "host": "localhost",
        "port": "5432",
        "database": "mydb",
        "user": "user",
        "password": "..."
    }
)

Snowflake

mixtrain provider add snowflake
client.create_dataset_provider(
    provider_type="snowflake",
    secrets={
        "account": "xyz123.us-east-1",
        "user": "user",
        "password": "...",
        "warehouse": "COMPUTE_WH",
        "database": "MYDB"
    }
)

BigQuery

mixtrain provider add bigquery
client.create_dataset_provider(
    provider_type="bigquery",
    secrets={
        "project_id": "my-project",
        "credentials_json": "{...}"
    }
)

Querying

table = client.get_dataset("my-table")

# Run SQL query
df = table.query("SELECT * FROM users WHERE created_at > '2024-01-01'")

pandas_df = df.to_pandas()

CLI

# Query dataset
mixtrain dataset query my-table "SELECT * LIMIT 100"

mixtrain dataset metadata my-table

Best Practices

  1. Use connection pooling for production workloads
  2. Create read replicas for query-heavy operations
  3. Use parameterized queries to prevent SQL injection
  4. Set appropriate timeouts for long-running queries

On this page