Screenshot to App¶

Drop a screenshot of any UI into the chat panel and a vision-capable AI model rebuilds it as a native Python app. This is Whittl's flagship feature: the thing most other AI coding tools can't do, and the single biggest reason to use Whittl over a browser-based alternative.

What you can drop in¶

Your own app screenshots. Tired of a command-line tool? Drop a screenshot of the idea, get a GUI.
Competitor or reference UIs. A login screen, a settings panel, a notification view. The AI extracts layout intent, not pixel-perfect copies.
Figma mocks or design sketches. Export a PNG, paste it in, iterate on the result in chat.
Error screenshots or traceback windows. The AI reads the error text and debugs the code.
Photos of paper sketches. Low-fidelity works; the AI picks up on control placement.

Privacy by default

Images never leave your machine without your action. You pick the backend, you attach the image, and the image goes only to that backend's API (Anthropic, Google, OpenRouter, etc.). There's no middleman server and no silent telemetry.

How to use it¶

Three ways to attach an image:

Drag and drop — drag any image file directly onto the chat panel.
Clipboard paste — copy a screenshot (Windows: Win+Shift+S, Linux: depends on your desktop environment), focus the chat input, and press Ctrl+V.
File attachment button — click the image icon in the chat input to browse.

The attached image appears inline above the chat input with filename and size. You can remove it before sending, or leave it attached while you type your prompt.

Then send a prompt like:

"Rebuild this as a PySide6 desktop app"
"Make me an app that looks like this screenshot"
"Turn this login screen into a working app with username validation"

Whittl routes the image plus your prompt to your selected backend. If the backend or model doesn't support vision, you'll see a warning and the image is dropped from the request.

Which models support vision¶

Direct backends¶

Backend	Default model	Vision support
Claude API	Opus / Sonnet / Haiku	Yes — all three tiers
Gemini API	2.5 Flash / 2.5 Pro / 3 Flash	Yes — all modern Gemini models
DeepSeek API	`deepseek-chat`	No — text-only. Use DeepSeek-VL2 via OpenRouter for vision.
Ollama (local)	Depends on pulled model	Partial — model must support vision (`llava`, `qwen-vl`, `llama3.2-vision`). Wiring complete in v2.4.

OpenRouter backend (most flexibility)¶

Any vision-capable model in OpenRouter's catalog works in Whittl. The Models dialog (click next to the backend dropdown) marks vision-capable models with a [Vision] chip. At time of writing, the vision-capable set includes:

Claude (Opus, Sonnet, Haiku) via anthropic/claude-*
GPT-4o and GPT-4o-mini via openai/gpt-4o*
Gemini 2.5 (Pro, Flash, Flash-Lite) via google/gemini-2.5-*
Llama 3.2 Vision via meta-llama/llama-3.2-*vision-instruct
Pixtral (Large and 12B) via mistralai/pixtral-*
Qwen-VL via qwen/qwen2-vl-*
Gemma 3 27B via google/gemma-3-27b-it

Filter the Models dialog to Vision to see the current live list from your OpenRouter account's available catalog.

Cheapest vision model that still works well

Gemini 2.5 Flash-Lite via OpenRouter is around $0.0375 per million input tokens and produces serviceable apps from typical UI screenshots. Qwen-VL 7B is a local-ish alternative through OpenRouter at similar cost. Claude Haiku is the cheapest "premium tier" option if you want more design nuance.

What to expect from the output¶

Vision-model code generation is impressive but not magic. Realistic expectations:

What works well¶

Overall layout. Sidebar + main content, top toolbar + content below, grid of tiles, modal dialogs. The AI consistently gets these right.
Control type identification. Buttons, dropdowns, checkboxes, tab bars, tree views. Correctly identified across all the vision-capable models.
Text content. Headings, labels, placeholder text, button labels. Usually transcribed correctly, especially on Claude and Gemini 2.5 Pro.
Color palette extraction. Approximate — the AI won't nail exact hex codes, but it'll match the overall feel (dark/light, warm/cool, accent color).

What doesn't work well¶

Pixel-perfect spacing. Padding, margins, and exact alignment need iteration. Expect to send follow-up prompts like "make the left margin wider".
Icons. The AI sees an icon shape but can't name it. You'll get a generic placeholder. Use the assets/ folder for your own icons, or ask Whittl to generate them in a follow-up.
Non-standard controls. A custom-drawn progress ring, a novel gesture, a unique layout pattern. The AI will substitute a generic approximation.
Multi-screen flows. A screenshot is one screen. Rebuilding a multi-screen app from one image is beyond one generation — break it up.

Recommended workflow¶

First prompt: capture the gestalt. "Rebuild this as a PySide6 desktop app with the same layout." Don't over-specify. Let the AI make the first pass.
Run it and look at what's wrong. The gap between the screenshot and the running app is the specification for the next prompt.
Second prompt: fix structural issues. "The sidebar needs to be wider and the content area should scroll."
Third prompt: polish details. "Make the buttons match the tan color from my brand palette."
Add assets. Drop any logos or icons into assets/ and prompt "use my logo at assets/logo.png in the header."

Typical cost for a complete screenshot-to-working-app flow on Qwen3.5-Plus via OpenRouter: around $0.02 total for 3-4 iterations. Claude Sonnet runs closer to $0.15-0.30. Ollama-local runs free if you have the RAM.

Troubleshooting¶

The AI asked 'What screenshot?'

The image may not have been sent. Check that:

The backend supports vision (see table above).
The image thumbnail appeared in the chat input before you hit Send. If it didn't, the attach step failed.
The model you picked is vision-capable. The Models dialog shows [Vision] chips.

If you're on OpenRouter with auto routing and the AI keeps asking about the screenshot, switch to an explicit vision model like google/gemini-2.5-flash-lite or openai/gpt-4o-mini.

The generated app looks nothing like the screenshot

Likely causes, roughly in order:

Using a non-vision model via OpenRouter. Double-check the [Vision] chip. Some models fake-accept the image format but produce generic output.
Prompt was too vague. "Make an app like this" gives the AI latitude to riff. Try "Rebuild this exact layout as a PySide6 app."
Image was too small or compressed. Native-resolution screenshots work better than thumbnails. The AI can't read tiny UI text that's been scaled down.

Image was 'received' but vision features feel weaker than Claude direct

OpenRouter's vision routing uses the OpenAI multimodal format for every backend. Claude-direct uses Anthropic's native image block format, which is slightly richer for image reasoning. If you care about the highest fidelity on a specific project, use the Claude API backend directly rather than Claude via OpenRouter.

What's next¶

Iterating on Generated Code — the workflow recipe for the "fix this detail" cycle after the first generation
Choosing a Backend — deeper guide to picking the right backend for vision work
OpenRouter — all the vision-capable models in one place with pricing and capability chips