Skip to content

Ollama (local)

Ollama runs AI models on your own hardware. No API key, no per-request cost, no network dependency, no data ever leaving your machine. The tradeoff: you need the RAM to run a useful model, and quality caps at whatever size your machine can handle.

For privacy-critical work, offline environments, or just zero-budget usage, Ollama is Whittl's only option.

Installing Ollama

  1. Go to ollama.com.
  2. Download the installer for your OS (Windows or Linux).
  3. Run it. On Windows it installs a system tray service; on Linux it installs a systemd unit.

Verify it's working:

ollama --version
ollama list        # empty at first; you haven't pulled any models yet

Pulling a model

Pick a model from ollama.com/library and pull it:

ollama pull qwen2.5-coder:7b

This downloads several gigabytes. First pull takes 5-15 minutes on decent internet. Subsequent pulls of different models go into the same library.

Model Size on disk RAM needed Quality for Whittl
qwen2.5-coder:7b ~4.5 GB 10 GB ⭐⭐⭐⭐ Best general-purpose 7B class
deepseek-coder:6.7b ~3.8 GB 8 GB ⭐⭐⭐ Python-focused, lightweight
qwen2.5-coder:14b ~9 GB 16 GB ⭐⭐⭐⭐⭐ Significantly better than 7B
codellama:13b ~7.5 GB 14 GB ⭐⭐⭐ Meta's coder, older but solid
qwen2.5-coder:32b ~19 GB 32 GB ⭐⭐⭐⭐⭐ Top local quality, wants a GPU

For most laptops and modest desktops: qwen2.5-coder:7b is the sweet spot. For 16GB+ RAM: qwen2.5-coder:14b is a major step up. For 32GB+ or a modern GPU: qwen2.5-coder:32b approaches cloud-model quality.

Telling Whittl about it

With Ollama running, select Local (Ollama) as the backend in Whittl's chat panel. Whittl auto-detects the Ollama daemon and populates the model dropdown with whatever you've pulled.

No API key to configure. No setup beyond pulling a model.

RAM and model size

Roughly how much RAM a model needs to run smoothly:

Model parameters RAM (CPU) RAM (GPU + VRAM)
3B 4 GB 2 GB
7B 8-10 GB 4-6 GB
13B-14B 14-16 GB 8-10 GB
30B-34B 32 GB 20-24 GB
70B+ 64+ GB 40+ GB

If you run a model that doesn't fit, it swaps to disk and becomes unusably slow (one token per second instead of 20). Whittl surfaces a clear error if the model fails to load.

Performance tuning

CPU vs GPU

Ollama auto-detects CUDA (NVIDIA) and some ROCm (AMD) configurations. Check how it's actually running:

ollama ps

Shows the loaded model and what percentage is on GPU vs CPU. 100% GPU is ideal; 50/50 means the model is too big for your GPU and Ollama is splitting. 0% GPU means you're running on CPU, which is much slower.

First-load latency

Ollama lazy-loads model weights. The first request after startup takes 5-30 seconds just to load the model into memory. Subsequent requests reuse the loaded model (fast).

This means: a fresh Whittl session's first generation on Ollama feels slow even if steady-state is fast. Just wait it out on the first one.

Keep the model warm

Ollama evicts unused models after a few minutes of inactivity. If you're iterating, this is fine (each request within 5 min reuses the loaded model). If you walk away for half an hour and come back, first request re-loads.

You can bump the keep-alive via environment variable before launching Ollama:

OLLAMA_KEEP_ALIVE=24h ollama serve

Makes Ollama hold the model for 24 hours. Costs RAM.

Vision-capable local models

Ollama has vision-capable models (llava:7b, qwen-vl:7b, llama3.2-vision:11b), but Whittl's image-input wiring to Ollama is limited. For reliable local vision work today, the cloud backends remain the more dependable path — Gemini 2.5 Flash-Lite's free tier is a reasonable privacy-adjacent alternative since it's cloud but generous and doesn't train on your data.

Tool use

Ollama's tool-use support is model-dependent. Models that work well with Whittl's edit_code / tool-use pipeline:

  • qwen2.5-coder:7b / 14b / 32b — reliable tool-use
  • deepseek-coder:6.7b — reliable
  • llama3.2:3b / 11b — decent
  • codellama:13b — hit or miss on complex tools
  • mistral:7b — works but inconsistent

If you find Ollama's modify pipeline is repeatedly failing with tool-format errors, switch to a different model or fall back to full regeneration mode.

Agent Mode on Ollama

Agent Mode's unbounded tool loop is supported on Ollama, but:

  • Tier-A caps apply (20 rounds max)
  • Weaker models may not handle Agent Mode well — the oscillation guard fires early
  • Cost is free, so the tradeoff is just time

For light agentic work on local models, qwen2.5-coder:14b is probably the floor for reliability. Smaller models hit the round cap without completing.

Cost-free but not free

Running Ollama locally costs:

  • Disk space (5-30 GB per model you keep pulled)
  • RAM while running (8-32 GB depending on model)
  • Electricity (CPU inference is significant on low-TDP laptops)
  • Your own time waiting for longer generations

Compare to cloud backends where a month of casual Whittl use costs $1-5. If your hardware is sitting there anyway, Ollama is free. If you'd need to buy more RAM specifically for it, the math differs.

When NOT to use Ollama

  • You need the absolute highest code quality. Local 14B models are behind Claude Sonnet on complex tasks. Not close.
  • You need vision right now. v2.4 fixes this but v2.3 doesn't support local vision.
  • Your hardware is modest. An 8 GB laptop can run a 7B model but generation takes 30-60 seconds per prompt. Usable but slow.
  • You want to iterate fast. Each local generation is slower than cloud. A 20-prompt session feels noticeably longer.

Troubleshooting

No models found in Ollama backend

Two causes:

  1. You haven't pulled any models yet. Run ollama pull qwen2.5-coder:7b.
  2. Ollama isn't running. ollama serve (Linux) or check the Ollama tray icon (Windows).

Ollama is very slow — seconds per token

Model is too big for your RAM and is swapping to disk. Use a smaller model.

GPU isn't being used

Check ollama ps for load distribution. If it says CPU:

  • NVIDIA: ensure latest CUDA drivers installed
  • AMD: ROCm support varies by card; some older cards fall back to CPU

Tool use / surgical edit fails consistently

The model you picked isn't great at tool-use format. Try qwen2.5-coder:7b or larger. Or force full regeneration mode via the Modify toggle.

Download fails with exit code 1 — especially for newer models

Almost always means your Ollama install is older than the model you're trying to pull. New model variants (e.g. gemma4:e4b, qwen3:30b) require a recent Ollama runtime to resolve their manifests; older clients silently 404 on the registry call.

Diagnosis:

  1. In the Ollama desktop app, search for the model name. If it doesn't appear in the search results, your local Ollama doesn't know about it yet.
  2. Run ollama --version in a terminal — compare against the latest at https://ollama.com/download.

Fix: update Ollama from https://ollama.com/download. Your existing pulled models survive the upgrade — ollama list will still show them all. Then retry the download.

Note: Whittl's installer doesn't bundle Ollama; it shells out to whichever ollama is on your PATH. Updating Ollama is a separate step from updating Whittl.

What's next