Vision Models

Run and fine-tune open-source vision models on Mixtrain — from visual language models to segmentation and object detection.

Example Models

Model	Provider	Parameters	Description
SmolVLM 2	Hugging Face	2.2B	Lightweight multimodal model for image and video understanding
SigLIP 2	Google	—	Visual encoder for image-text matching and classification
SAM 2.1	Meta	—	Segment anything in images and video
PaliGemma 2	Google	3B–28B	Visual question answering, captioning, and OCR

Quick Start

from mixtrain import Model

model = Model("smolvlm-2")

result = model.run({
    "image": "my-workspace/images/sample.jpg",
    "prompt": "Describe what you see in this image"
})

Vision Models

Example Models

Quick Start

On this page