AI
Builder Hub
How to Run Local LLMs with Claude Code – Installing Qwen3.5 and llama.cpp
buildAI2026-03-1610 min

How to Run Local LLMs with Claude Code – Installing Qwen3.5 and llama.cpp

Complete guide from Unsloth: install llama.cpp, download Qwen3.5-35B-A3B, start llama-server, and connect it to Claude Code for a fully offline, free AI coding agent.

You can use Claude Code — Anthropic's AI coding agent — with a model running entirely on your local machine, no API account required, no cloud costs. This guide is based directly on Unsloth AI's official documentation and walks through every step from installation to running your first task.

The model used in this guide: Qwen3.5-35B-A3B — a Mixture-of-Experts (MoE) model that is compact, fast, and well-suited for coding agents.

Running Qwen3.5 local LLM with Claude Code and llama.cpp

Architecture: llama-server (local) → OpenAI-compatible endpoint → Claude Code agent


Architecture Overview

The entire setup works through this flow:

  1. llama.cpp — an open-source framework for running LLMs on personal computers (Mac, Linux, Windows)
  2. llama-server — serves the model via HTTP, exposing an OpenAI-compatible API endpoint on port 8001
  3. Claude Code — redirect the ANTHROPIC_BASE_URL environment variable to your local server instead of Anthropic's cloud

Result: Claude Code works exactly as normal, but is actually calling Qwen3.5 running on your machine.


Step 1: Choose the Right Model

Unsloth recommends several Qwen3.5 variants depending on your VRAM:

ModelVRAM neededSpeedNotes
Qwen3.5-35B-A3B~24GBFastestRecommended for RTX 4090
Qwen3.5-27BLess~2x slowerIf not enough VRAM for 35B
Qwen3.5-9B / 4B / 2BVery littleFastFor lower-spec machines

💡 Qwen3-Coder-Next is also an excellent choice for coding tasks if you have sufficient VRAM.


Step 2: Install llama.cpp

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev git-all -y

git clone https://github.com/ggml-org/llama.cpp

cmake llama.cpp -B llama.cpp/build \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_CUDA=ON   # Change to OFF if no GPU

cmake --build llama.cpp/build --config Release -j --clean-first \
  --target llama-cli llama-mtmd-cli llama-server llama-gguf-split

cp llama.cpp/build/bin/llama-* llama.cpp

macOS (Apple Silicon): Use -DGGML_CUDA=OFF — Metal support is enabled automatically, no additional flags needed.


Step 3: Download the Qwen3.5 Model

pip install huggingface_hub hf_transfer

hf download unsloth/Qwen3.5-35B-A3B-GGUF \
  --local-dir unsloth/Qwen3.5-35B-A3B-GGUF \
  --include "*UD-Q4_K_XL*"
  # Use "*UD-Q2_K_XL*" for Dynamic 2-bit if you need to save VRAM

Unsloth uses UD-Q4_K_XL quantization — the best balance between file size and accuracy.


Step 4: Start llama-server

Run this in a separate terminal (use tmux or open a new window):

./llama.cpp/llama-server \
  --model unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
  --alias "unsloth/Qwen3.5-35B-A3B" \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.00 \
  --port 8001 \
  --kv-unified \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn on --fit on \
  --ctx-size 131072

Key parameters:

  • --temp 0.6 --top-p 0.95 --top-k 20 — Qwen's recommended sampling parameters for thinking mode
  • --cache-type-k q8_0 --cache-type-v q8_0 — KV cache quantization for reduced VRAM usage. Do not use f16 — Qwen3.5 loses accuracy with f16 KV cache
  • --fit on — auto-offload when exceeding VRAM
  • --ctx-size 131072 — 128K token context window

To disable thinking mode (can improve speed for coding):

--chat-template-kwargs "{\"enable_thinking\": false}"

Step 5: Connect Claude Code

Install Claude Code

npm install -g @anthropic-ai/claude-code

Configure (Linux / Mac)

export ANTHROPIC_BASE_URL="http://localhost:8001"
export ANTHROPIC_API_KEY="sk-no-key-required"

To persist across new terminals, add these lines to ~/.bashrc or ~/.zshrc.

Configure (Windows PowerShell)

$env:ANTHROPIC_BASE_URL="http://localhost:8001"
$env:CLAUDE_CODE_ATTRIBUTION_HEADER=0

Skip the Login Screen

If Claude Code still prompts you to sign in on first run, add this to ~/.claude.json:

{
  "hasCompletedOnboarding": true,
  "primaryApiKey": "sk-dummy-key"
}

Run Claude Code

cd your-project-folder
claude

To run without approval prompts for every command (use with caution):

claude --dangerously-skip-permissions

⚠️ Critical Fix: Claude Code Running 90% Slower

This is the most important issue flagged in Unsloth's documentation. Claude Code recently started prepending an Attribution Header to every request, which invalidates the KV Cache — causing inference speed to drop by up to 90%.

Fix: Edit ~/.claude/settings.json:

cat > ~/.claude/settings.json

Paste the following, then press Enter and Ctrl+D to save:

{
  "env": {
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
  }
}

⚠️ Important: Using export CLAUDE_CODE_ATTRIBUTION_HEADER=0 in the terminal does NOT work. The fix must be applied through settings.json.


VS Code / Cursor Extension

Claude Code can also run directly inside your editor:

  • Install the Claude Code extension from VS Code Marketplace (Ctrl+Shift+X → search "Claude Code")
  • Add "claudeCode.disableLoginPrompt": true to settings.json to skip the login screen
  • Ensure ANTHROPIC_BASE_URL is set in your environment before opening VS Code

Practical Usage Tips

When to use Thinking Mode?

  • Thinking mode excels at complex multi-step reasoning tasks (architecture decisions, difficult debugging)
  • Disable thinking (enable_thinking: false) for routine coding tasks to increase speed

Not enough VRAM?

  • Reduce --ctx-size to 32768 or 65536
  • Use a lighter quantization: *UD-Q2_K_XL* instead of *UD-Q4_K_XL*
  • Switch to Qwen3.5-9B or 4B

Verify the server is running:

curl http://localhost:8001/v1/models

Conclusion

With llama.cpp + Qwen3.5 + Claude Code, you have a fully local AI coding agent — no internet required, no API fees, and stable enough for real production projects.

Key points to remember:

  • Use --cache-type-k q8_0 --cache-type-v q8_0, never f16 KV cache with Qwen3.5
  • Fix the Attribution Header via ~/.claude/settings.json, not via export
  • Start with Qwen3.5-35B-A3B on an RTX 4090; fall back to 27B or smaller if VRAM is limited

Full reference: Unsloth AI Documentation