Ollama (local)¶

Ollama runs AI models on your own hardware. No API key, no per-request cost, no network dependency, no data ever leaving your machine. The tradeoff: you need the RAM to run a useful model, and quality caps at whatever size your machine can handle.

For privacy-critical work, offline environments, or just zero-budget usage, Ollama is Whittl's only option.

Installing Ollama¶

Go to ollama.com.
Download the installer for your OS (Windows or Linux).
Run it. On Windows it installs a system tray service; on Linux it installs a systemd unit.

Verify it's working:

ollama --version
ollama list        # empty at first; you haven't pulled any models yet

Pulling a model¶

Pick a model from ollama.com/library and pull it:

ollama pull qwen2.5-coder:7b

This downloads several gigabytes. First pull takes 5-15 minutes on decent internet. Subsequent pulls of different models go into the same library.

Recommended models for Whittl¶

Model	Size on disk	RAM needed	Quality for Whittl
`qwen2.5-coder:7b`	~4.5 GB	10 GB	⭐⭐⭐⭐ Best general-purpose 7B class
`deepseek-coder:6.7b`	~3.8 GB	8 GB	⭐⭐⭐ Python-focused, lightweight
`qwen2.5-coder:14b`	~9 GB	16 GB	⭐⭐⭐⭐⭐ Significantly better than 7B
`codellama:13b`	~7.5 GB	14 GB	⭐⭐⭐ Meta's coder, older but solid
`qwen2.5-coder:32b`	~19 GB	32 GB	⭐⭐⭐⭐⭐ Top local quality, wants a GPU

For most laptops and modest desktops: qwen2.5-coder:7b is the sweet spot. For 16GB+ RAM: qwen2.5-coder:14b is a major step up. For 32GB+ or a modern GPU: qwen2.5-coder:32b approaches cloud-model quality.

Telling Whittl about it¶

With Ollama running, select Local (Ollama) as the backend in Whittl's chat panel. Whittl auto-detects the Ollama daemon and populates the model dropdown with whatever you've pulled.

No API key to configure. No setup beyond pulling a model.

RAM and model size¶

Roughly how much RAM a model needs to run smoothly:

Model parameters	RAM (CPU)	RAM (GPU + VRAM)
3B	4 GB	2 GB
7B	8-10 GB	4-6 GB
13B-14B	14-16 GB	8-10 GB
30B-34B	32 GB	20-24 GB
70B+	64+ GB	40+ GB

If you run a model that doesn't fit, it swaps to disk and becomes unusably slow (one token per second instead of 20). Whittl surfaces a clear error if the model fails to load.

Performance tuning¶

CPU vs GPU¶

Ollama auto-detects CUDA (NVIDIA) and some ROCm (AMD) configurations. Check how it's actually running:

ollama ps

Shows the loaded model and what percentage is on GPU vs CPU. 100% GPU is ideal; 50/50 means the model is too big for your GPU and Ollama is splitting. 0% GPU means you're running on CPU, which is much slower.

First-load latency¶

Ollama lazy-loads model weights. The first request after startup takes 5-30 seconds just to load the model into memory. Subsequent requests reuse the loaded model (fast).

This means: a fresh Whittl session's first generation on Ollama feels slow even if steady-state is fast. Just wait it out on the first one.

Keep the model warm¶

Ollama evicts unused models after a few minutes of inactivity. If you're iterating, this is fine (each request within 5 min reuses the loaded model). If you walk away for half an hour and come back, first request re-loads.

You can bump the keep-alive via environment variable before launching Ollama:

OLLAMA_KEEP_ALIVE=24h ollama serve

Makes Ollama hold the model for 24 hours. Costs RAM.

Vision-capable local models¶

Ollama has vision-capable models (llava:7b, qwen-vl:7b, llama3.2-vision:11b), but Whittl's image-input wiring to Ollama is limited. For reliable local vision work today, the cloud backends remain the more dependable path — Gemini 2.5 Flash-Lite's free tier is a reasonable privacy-adjacent alternative since it's cloud but generous and doesn't train on your data.

Tool use¶

Ollama's tool-use support is model-dependent. Models that work well with Whittl's edit_code / tool-use pipeline:

qwen2.5-coder:7b / 14b / 32b — reliable tool-use
deepseek-coder:6.7b — reliable
llama3.2:3b / 11b — decent
codellama:13b — hit or miss on complex tools
mistral:7b — works but inconsistent

If you find Ollama's modify pipeline is repeatedly failing with tool-format errors, switch to a different model or fall back to full regeneration mode.

Agent Mode on Ollama¶

Agent Mode's unbounded tool loop is supported on Ollama, but:

Tier-A caps apply (20 rounds max)
Weaker models may not handle Agent Mode well — the oscillation guard fires early
Cost is free, so the tradeoff is just time

For light agentic work on local models, qwen2.5-coder:14b is probably the floor for reliability. Smaller models hit the round cap without completing.

Cost-free but not free¶

Running Ollama locally costs:

Disk space (5-30 GB per model you keep pulled)
RAM while running (8-32 GB depending on model)
Electricity (CPU inference is significant on low-TDP laptops)
Your own time waiting for longer generations

Compare to cloud backends where a month of casual Whittl use costs $1-5. If your hardware is sitting there anyway, Ollama is free. If you'd need to buy more RAM specifically for it, the math differs.

When NOT to use Ollama¶

You need the absolute highest code quality. Local 14B models are behind Claude Sonnet on complex tasks. Not close.
You need vision right now. v2.4 fixes this but v2.3 doesn't support local vision.
Your hardware is modest. An 8 GB laptop can run a 7B model but generation takes 30-60 seconds per prompt. Usable but slow.
You want to iterate fast. Each local generation is slower than cloud. A 20-prompt session feels noticeably longer.

Troubleshooting¶

No models found in Ollama backend

Two causes:

You haven't pulled any models yet. Run ollama pull qwen2.5-coder:7b.
Ollama isn't running. ollama serve (Linux) or check the Ollama tray icon (Windows).

Ollama is very slow — seconds per token

Model is too big for your RAM and is swapping to disk. Use a smaller model.

GPU isn't being used

Check ollama ps for load distribution. If it says CPU:

NVIDIA: ensure latest CUDA drivers installed
AMD: ROCm support varies by card; some older cards fall back to CPU

Tool use / surgical edit fails consistently

The model you picked isn't great at tool-use format. Try qwen2.5-coder:7b or larger. Or force full regeneration mode via the Modify toggle.

Download fails with exit code 1 — especially for newer models

Almost always means your Ollama install is older than the model you're trying to pull. New model variants (e.g. gemma4:e4b, qwen3:30b) require a recent Ollama runtime to resolve their manifests; older clients silently 404 on the registry call.

Diagnosis:

In the Ollama desktop app, search for the model name. If it doesn't appear in the search results, your local Ollama doesn't know about it yet.
Run ollama --version in a terminal — compare against the latest at https://ollama.com/download.

Fix: update Ollama from https://ollama.com/download. Your existing pulled models survive the upgrade — ollama list will still show them all. Then retry the download.

Note: Whittl's installer doesn't bundle Ollama; it shells out to whichever ollama is on your PATH. Updating Ollama is a separate step from updating Whittl.

What's next¶

Choosing a Backend — comparison including Ollama's role
Multi-file Projects — smart routing matters more on local models (smaller context windows)
Performance Tuning — Ollama-specific tuning options