MixtrainDocsBlog

Evaluations let you compare outputs from different models side-by-side. The comparison view supports images, videos, 3D models, audio, and text - making it easy to visually assess quality across models.

Overview

An evaluation references columns from your datasets and displays them in a comparison grid. Typically each column represents a different model's output or metadata like latency/cost for the same inputs.

Creating an Evaluation

The easiest way to create an evaluation is from a dataset with column types. Eval.from_dataset() automatically reads the dataset's column types and builds the comparison config:

from mixtrain import Eval

eval = Eval.from_dataset("image-gen-results")

This picks up all typed columns (image, video, audio, text, etc.) from the dataset and creates a side-by-side comparison view.

Selecting specific columns

Use the columns parameter to choose which columns to include and their order:

eval = Eval.from_dataset(
    "image-gen-results",
    name="flux-vs-sdxl",
    columns=["prompt", "flux_output", "sdxl_output"]
)

Manual configuration

For full control, use Eval.create() with an explicit config:

eval = Eval.create(
    name="flux-vs-sdxl",
    config={
        "datasets": [
            {"tableName": "image-gen-results", "columnName": "prompt", "dataType": "text"},
            {"tableName": "image-gen-results", "columnName": "flux_output", "dataType": "image"},
            {"tableName": "image-gen-results", "columnName": "sdxl_output", "dataType": "image"},
        ]
    },
    description="Compare Flux Pro vs SDXL image outputs"
)

The datasets array defines which columns to show in the comparison view:

  • tableName - The dataset containing the data
  • columnName - The column to display
  • dataType - How to render: text, image, video, audio, or 3d

Workflow: Generate and Compare

A typical workflow is to run multiple models on the same inputs, store results in a dataset, then create an evaluation.

from mixtrain import Model, Dataset, Image

# Load test prompts
dataset = Dataset("test-prompts")
prompts = dataset.to_pandas()

# Run both models
flux = Model("flux-pro")
flex = Model("flux2-flex")

results = []
for _, row in prompts.iterrows():
    prompt = row["prompt"]
    results.append({
        "prompt": prompt,
        "fluxpro_output": flux.run(prompt=prompt).image.url,
        "fluxflex_output": flex.run(prompt=prompt).image.url,
    })

# Save results — column types are auto-detected from URLs
import pandas as pd
ds = Dataset.from_pandas(pd.DataFrame(results))
ds.save("image-gen-results")

Then create the evaluation to view results:

from mixtrain import Eval

# Automatically uses column types from the dataset
eval = Eval.from_dataset(
    "image-gen-results",
    name="flux-pro-vs-flex",
    columns=["prompt", "fluxpro_output", "fluxflex_output"]
)

Managing Evaluations

Get an Evaluation

from mixtrain import Eval

eval = Eval("flux-vs-sdxl")
print(eval.config)
print(eval.description)

Update an Evaluation

eval.update(
    description="Updated comparison",
    config={
        "datasets": [
            {"tableName": "image-gen-results", "columnName": "prompt", "dataType": "text"},
            {"tableName": "image-gen-results", "columnName": "fluxpro_output", "dataType": "image"},
            {"tableName": "image-gen-results", "columnName": "fluxflex_output", "dataType": "image"},
            {"tableName": "image-gen-results", "columnName": "dalle_output", "dataType": "image"},
        ]
    }
)

List All Evaluations

from mixtrain import list_evals

for eval in list_evals():
    print(f"{eval.name}: {eval.description}")

Delete an Evaluation

eval.delete()

Supported Data Types

TypeDescription
textPlain text, markdown, or code
imageImages (PNG, JPEG, WebP, etc.)
videoVideos (MP4, WebM, etc.)
audioAudio files (MP3, WAV, etc.)
3d3D models (GLB, GLTF)

Next Steps

  • Datasets - Store model outputs as datasets
  • Models - Run models to generate outputs for comparison

On this page