Ollama (local)¶
Ollama runs AI models on your own hardware. No API key, no per-request cost, no network dependency, no data ever leaving your machine. The tradeoff: you need the RAM to run a useful model, and quality caps at whatever size your machine can handle.
For privacy-critical work, offline environments, or just zero-budget usage, Ollama is Whittl's only option.
Installing Ollama¶
- Go to ollama.com.
- Download the installer for your OS (Windows or Linux).
- Run it. On Windows it installs a system tray service; on Linux it installs a systemd unit.
Verify it's working:
Pulling a model¶
Pick a model from ollama.com/library and pull it:
This downloads several gigabytes. First pull takes 5-15 minutes on decent internet. Subsequent pulls of different models go into the same library.
Recommended models for Whittl¶
| Model | Size on disk | RAM needed | Quality for Whittl |
|---|---|---|---|
qwen2.5-coder:7b |
~4.5 GB | 10 GB | ⭐⭐⭐⭐ Best general-purpose 7B class |
deepseek-coder:6.7b |
~3.8 GB | 8 GB | ⭐⭐⭐ Python-focused, lightweight |
qwen2.5-coder:14b |
~9 GB | 16 GB | ⭐⭐⭐⭐⭐ Significantly better than 7B |
codellama:13b |
~7.5 GB | 14 GB | ⭐⭐⭐ Meta's coder, older but solid |
qwen2.5-coder:32b |
~19 GB | 32 GB | ⭐⭐⭐⭐⭐ Top local quality, wants a GPU |
For most laptops and modest desktops: qwen2.5-coder:7b is the sweet spot. For 16GB+ RAM: qwen2.5-coder:14b is a major step up. For 32GB+ or a modern GPU: qwen2.5-coder:32b approaches cloud-model quality.
Telling Whittl about it¶
With Ollama running, select Local (Ollama) as the backend in Whittl's chat panel. Whittl auto-detects the Ollama daemon and populates the model dropdown with whatever you've pulled.
No API key to configure. No setup beyond pulling a model.
RAM and model size¶
Roughly how much RAM a model needs to run smoothly:
| Model parameters | RAM (CPU) | RAM (GPU + VRAM) |
|---|---|---|
| 3B | 4 GB | 2 GB |
| 7B | 8-10 GB | 4-6 GB |
| 13B-14B | 14-16 GB | 8-10 GB |
| 30B-34B | 32 GB | 20-24 GB |
| 70B+ | 64+ GB | 40+ GB |
If you run a model that doesn't fit, it swaps to disk and becomes unusably slow (one token per second instead of 20). Whittl surfaces a clear error if the model fails to load.
Performance tuning¶
CPU vs GPU¶
Ollama auto-detects CUDA (NVIDIA) and some ROCm (AMD) configurations. Check how it's actually running:
Shows the loaded model and what percentage is on GPU vs CPU. 100% GPU is ideal; 50/50 means the model is too big for your GPU and Ollama is splitting. 0% GPU means you're running on CPU, which is much slower.
First-load latency¶
Ollama lazy-loads model weights. The first request after startup takes 5-30 seconds just to load the model into memory. Subsequent requests reuse the loaded model (fast).
This means: a fresh Whittl session's first generation on Ollama feels slow even if steady-state is fast. Just wait it out on the first one.
Keep the model warm¶
Ollama evicts unused models after a few minutes of inactivity. If you're iterating, this is fine (each request within 5 min reuses the loaded model). If you walk away for half an hour and come back, first request re-loads.
You can bump the keep-alive via environment variable before launching Ollama:
Makes Ollama hold the model for 24 hours. Costs RAM.
Vision-capable local models¶
Ollama has vision-capable models (llava:7b, qwen-vl:7b, llama3.2-vision:11b), but Whittl's image-input wiring to Ollama is limited. For reliable local vision work today, the cloud backends remain the more dependable path — Gemini 2.5 Flash-Lite's free tier is a reasonable privacy-adjacent alternative since it's cloud but generous and doesn't train on your data.
Tool use¶
Ollama's tool-use support is model-dependent. Models that work well with Whittl's edit_code / tool-use pipeline:
- qwen2.5-coder:7b / 14b / 32b — reliable tool-use
- deepseek-coder:6.7b — reliable
- llama3.2:3b / 11b — decent
- codellama:13b — hit or miss on complex tools
- mistral:7b — works but inconsistent
If you find Ollama's modify pipeline is repeatedly failing with tool-format errors, switch to a different model or fall back to full regeneration mode.
Agent Mode on Ollama¶
Agent Mode's unbounded tool loop is supported on Ollama, but:
- Tier-A caps apply (20 rounds max)
- Weaker models may not handle Agent Mode well — the oscillation guard fires early
- Cost is free, so the tradeoff is just time
For light agentic work on local models, qwen2.5-coder:14b is probably the floor for reliability. Smaller models hit the round cap without completing.
Cost-free but not free¶
Running Ollama locally costs:
- Disk space (5-30 GB per model you keep pulled)
- RAM while running (8-32 GB depending on model)
- Electricity (CPU inference is significant on low-TDP laptops)
- Your own time waiting for longer generations
Compare to cloud backends where a month of casual Whittl use costs $1-5. If your hardware is sitting there anyway, Ollama is free. If you'd need to buy more RAM specifically for it, the math differs.
When NOT to use Ollama¶
- You need the absolute highest code quality. Local 14B models are behind Claude Sonnet on complex tasks. Not close.
- You need vision right now. v2.4 fixes this but v2.3 doesn't support local vision.
- Your hardware is modest. An 8 GB laptop can run a 7B model but generation takes 30-60 seconds per prompt. Usable but slow.
- You want to iterate fast. Each local generation is slower than cloud. A 20-prompt session feels noticeably longer.
Troubleshooting¶
No models found in Ollama backend
Two causes:
- You haven't pulled any models yet. Run
ollama pull qwen2.5-coder:7b. - Ollama isn't running.
ollama serve(Linux) or check the Ollama tray icon (Windows).
Ollama is very slow — seconds per token
Model is too big for your RAM and is swapping to disk. Use a smaller model.
GPU isn't being used
Check ollama ps for load distribution. If it says CPU:
- NVIDIA: ensure latest CUDA drivers installed
- AMD: ROCm support varies by card; some older cards fall back to CPU
Tool use / surgical edit fails consistently
The model you picked isn't great at tool-use format. Try qwen2.5-coder:7b or larger. Or force full regeneration mode via the Modify toggle.
Download fails with exit code 1 — especially for newer models
Almost always means your Ollama install is older than the model you're trying to pull. New model variants (e.g. gemma4:e4b, qwen3:30b) require a recent Ollama runtime to resolve their manifests; older clients silently 404 on the registry call.
Diagnosis:
- In the Ollama desktop app, search for the model name. If it doesn't appear in the search results, your local Ollama doesn't know about it yet.
- Run
ollama --versionin a terminal — compare against the latest at https://ollama.com/download.
Fix: update Ollama from https://ollama.com/download. Your existing pulled models survive the upgrade — ollama list will still show them all. Then retry the download.
Note: Whittl's installer doesn't bundle Ollama; it shells out to whichever ollama is on your PATH. Updating Ollama is a separate step from updating Whittl.
What's next¶
- Choosing a Backend — comparison including Ollama's role
- Multi-file Projects — smart routing matters more on local models (smaller context windows)
- Performance Tuning — Ollama-specific tuning options