Evaluations

Compare public and private model outputs side-by-side on a dataset.

Text-to-Image Evaluation

Compare image generation models on the same prompts:

from mixtrain import MixFlow, Model, Dataset, Eval, Image


class T2IEvaluation(MixFlow):
    """Compare text-to-image models side-by-side."""

    def run(
        self,
        input_dataset: Dataset,
        models: list[Model] | None = None,
        limit: int = -1,
    ):
        """Run text-to-image evaluation.

        Args:
            input_dataset: Dataset containing prompts
            models: Models to compare (default: flux-pro, stable-diffusion-xl)
            limit: Number of prompts (-1 for all)
        """
        if models is None:
            models = [Model("flux-pro"), Model("stable-diffusion-xl")]

        # Load prompts from dataset
        prompts = input_dataset.to_pandas()
        if limit > 0:
            prompts = prompts.head(limit)

        results = []
        for _, row in prompts.iterrows():
            prompt = row["prompt"]
            result = {"prompt": prompt}

            # Run each model
            for model in models:
                output = model.run({"prompt": prompt})
                result[f"{model.name}_image"] = output.image.url

            results.append(result)

        # Create output dataset and evaluation
        output_dataset = Dataset.create_from_dataframe(
            pd.DataFrame(results),
            name="t2i-comparison"
        )

        return {
            "evaluation": Eval.create(
                name="t2i-eval",
                dataset=output_dataset
            ),
            "dataset": output_dataset
        }

Video Generation Evaluation

Compare video generation models:

class VideoEvaluation(MixFlow):

    def run(
        self,
        prompts: list[str],
        models: list[Model] | None = None,
    ):
        """Run video generation evaluation.

        Args:
            prompts: List of prompts to evaluate
            models: Models to compare (default: hunyuan-video, runway-gen3)
        """
        if models is None:
            models = [Model("hunyuan-video"), Model("runway-gen3")]

        results = Model.batch(
            models=models,
            inputs_list=[{"prompt": p} for p in prompts],
            max_in_flight=10
        )

        # Create comparison view
        return {"evaluation": Eval.create(...)}

Running Evaluations

# Create evaluation workflow
mixtrain workflow create eval_t2i.py \
  --name t2i-evaluation

# Run with specific models and dataset
mixtrain workflow run t2i-evaluation \
  --input '{"models": ["flux-pro", "dalle-3"], "input_dataset": "my-prompts", "limit": 100}'

Running the workflow will create an evaluation and return a new dataset of containing the model outputs. The above command will print link to the workflow run, where you can find the link to the evaluation in the Mixtrain UI.

Viewing Results

After running the workflow, you can view the evaluation results in the Mixtrain UI:

Go to the workflow run page. This is also the link printed by the mixtrain workflow run command.
Find the link to the evaluation in the "Outputs" section.
Click the link to view the evaluation results.

Alternatively, you can view the evaluation results in the Mixtrain UI:

Go to the Evaluations tab in your workspace.
Find the evaluation by name.
Click the evaluation to view the results.

Next Steps

Evaluations Guide - Full documentation
Models Guide - Working with models