Chapter 23 / 40
AI on-prem
The Framework Desktop in the server room runs AI models locally. When Rani, Paul, or Jack ask the system to do something — draft a letter, look up a customer, generate a brief — the AI runs at Gennevilliers, not at a cloud provider. Everything is on the hardware Sodimo already owns.
Status: Blocked — models are selected and tested in sodimo/harness (the OS harness repo). The Framework Desktop is the prerequisite to go live.
Three local tiers
The system picks the right tier for each request. The user does not have to choose.
Speed tier — the default for most requests. Fast enough that a reply lands in under 30 seconds.
Used by:
- Morning briefs (Rani, Paul, Jack)
- Pre-call briefs
- Churn alerts
- Daily prospecting list
- Routine queries
Quality tier — for precision-critical synthesis where longer response time is acceptable.
Used by:
- Monthly board report for Michel
- Contract portfolio analysis
- Visit-debrief synthesis across a week of calls
Ensemble tier — for the highest-stakes analysis. Three models generate independent assessments, each is challenged by a devil’s advocate, then the outputs are synthesised into one. Takes the longest.
Used by:
- Opportunity analysis on a deal or renewal
- High-stakes deal review before a contract is signed
Model selection within each tier is managed in sodimo/harness and is tested against real Sodimo workloads before being rolled out. The tier is the contract with the user; the specific model behind it can be swapped without changing how anyone works.
The runtime underneath
The local models run under llama.cpp, routed through llama-swap so one process can host several models and swap between them on demand. There is one runtime on the box, tested against the exact AMD GPU, pinned to a known-good version. Upstream releases do not auto-update; a version bump is a deliberate edit to sodimo/dotfiles.
Image · ghcr.io/mostlygeek/llama-swap:vulkan — Vulkan-only posture. The image does not ship a ramalama binary; llama-swap execs /app/llama-server directly. ROCm-forward-compat vestiges (/dev/kfd, HSA_OVERRIDE_GFX_VERSION=11.0.0) remain on the container for future re-enable and are tracked in sodimo/dotfiles#14.
Model inventory (post 2026-04-22):
| Alias | Model | Role |
|---|---|---|
local-task | qwen3-4b (Vulkan GGUF) | default chat, tool use |
local-heavy | gpt-oss-120b unsloth UD-Q8_K_XL (2-shard GGUF, 65k ctx, reasoning_effort=high) | heavy reasoning; deferred until harness#11 kernel cmdline lands + 100 GB GGUF downloaded |
local-coder | qwen3-coder-30b (TTL=600s, heavy group) | code tasks |
local-embeddings | qwen3-embedding-8b (TTL=0) | RAG embeddings (see OpenWebUI) |
Strix Halo runtime posture
Every model cmd: block in llama-swap.yaml carries the kyuz0 mandatory Strix-Halo flag set. Rationale per flag:
| Flag | Why |
|---|---|
--no-mmap | Strix Halo iGPU wants contiguous VRAM; mmap fragments it |
-fa on | flash attention |
-ngl 999 | offload every layer to GPU |
--batch-size 4096 --ubatch-size 512 | Strix-Halo-tuned throughput |
--cache-type-k q8_0 --cache-type-v q8_0 | Q8 KV cache — ~2x effective context |
--jinja | Jinja-templated chat format |
--direct-io --cache-prompt --cache-reuse 256 | prompt cache |
--threads 12 | Strix Halo has 16 cores; 12 for inference leaves 4 for host |
The operational card (ports, env, mounts, exact image digest) lives in Quadlet reference → llama-swap. This chapter owns the why; 38 owns the wiring.
Self-proxy rule. In llama-swap.yaml, every model’s proxy: must target http://127.0.0.1:NNNN, never http://llama-swap:NNNN. llama-swap and the spawned llama-server share a netns — using the container hostname forces an unnecessary netavark DNS round-trip that 502s under the on-box adguard DNAT.
Vulkan driver trade-off. The mostlygeek image ships RADV. kyuz0 recommends AMDVLK for prompt-heavy / long-context workloads. Upgrade path = custom gateway image layering amdvlk-2025.Q2.1.rpm. Deferred.
kyuz0 reference. Strix-Halo toolbox pinned at commit 1421e870…; full interpolation notes in sodimo/dotfiles/docs/kyuz0-toolbox.md, resync procedure in docs/resync-runbook.md.
Smoke benchmark — 2026-04-22
End-to-end through OpenWebUI → LiteLLM → llama-swap → llama-server on the dev-box Framework Desktop (same silicon as the target harness):
local-task(qwen3-4b): 53 tok/s on RADV Vulkan, 11.5 s cold, 0.08–0.20 s warm.local-heavy(gpt-oss-120b UD-Q8_K_XL, 16 k ctx): 46.6 tok/s steady-state on RADV Vulkan, ~25 s cold load + 4.3 s first-200-tok decode, 2.7 s warm, peak ~60 GiB GTT. Ran at 16 k ctx (reduced from 65 k) pending the harness#11 kernel-cmdline deploy. Full run notes:sodimo/dotfiles/docs/gpt-oss-120b-smoke.md.
Cross-reference: OpenWebUI.
Escalation policy and usage counter
Default routing
Every AI invocation routes to the on-prem stack first. The alias-to-tier mapping is fixed:
| Alias | Tier | Default use |
|---|---|---|
local-task | Speed | Chat, tool-use, short drafts |
local-heavy | Quality | Reasoning, synthesis, board reports |
local-coder | Quality | Code generation and review |
cloud-heavy | Cloud (Claude Opus) | Opt-in escalation only |
cloud-heavy is not a fallback the system reaches on its own; it is a deliberate per-invocation opt-in. The caller must pass escalate: "cloud-heavy" in the tool invocation. Nothing escalates silently.
Escalation triggers
Four conditions warrant explicit escalation to cloud-heavy:
- Explicit flag — the skill, agent, or tool invocation carries
escalate: "cloud-heavy". This is the primary trigger; the others are documentation, not automation. - Context-length overflow — the assembled prompt exceeds the local model’s context window (65 k tokens for
local-heavy; shorter for the other tiers). The caller is responsible for detecting this before dispatch. - Task-class quality regression — a specific task class (e.g., legal-contract review) has been characterised as requiring cloud-grade output and is flagged in the skill manifest. The skill sets
escalate: "cloud-heavy"unconditionally for that class. - Regulatory requirement — audit deliverables or legal opinions where the operator needs a documented, cloud-provider-issued model version for chain-of-custody reasons.
No other trigger is recognised. “It felt slow” or “the local answer seemed short” are not escalation triggers; they are feedback for the model-selection process in sodimo/harness.
Usage counter
Every request that routes through llama-swap emits a run_ledger row via the Worker ledger_write tool (see ch42 What the AI can access). The row schema matches Principle 2 exactly:
| Field | Value for local runs |
|---|---|
model | alias (local-task, local-heavy, local-coder) |
provider | local |
tokens_in | prompt token count from llama-server |
tokens_out | completion token count |
latency_ms | wall-clock duration from dispatch to first token |
cost_eur | 0 — on-prem runs carry no per-token cost |
cost_eur_if_cloud | counterfactual: what the same prompt+completion would have cost at Claude Opus pricing |
The cost_eur_if_cloud column is the savings number. Summed across all local runs, SUM(cost_eur_if_cloud) - SUM(cost_eur) is the headline. One SQL query against run_ledger.
Cloud escalation rows record the actual cost in cost_eur; cost_eur_if_cloud equals cost_eur for those rows (no counterfactual saving on a cloud run).
Surface
The cumulative counter — total runs, total prompt tokens, total completion tokens, implied cloud cost saved — is displayed on the launchpad tile (ch61 Launchpad) and on a rolling 7-day chart on a progress-adjacent page. Exact surface placement is an open operator call; tracked as a follow-up.
Grafana / Prometheus / K3s-style metric export is deferred. The ledger already holds the data; a Grafana-backed dashboard would require a separate D1-to-Prometheus bridge with no additional analytical value at v1. Revisit tracked as Pivot 5a.
Pointer. → Principle 2, ch15 Design principles for the routing rationale; ch42 What the AI can access for ledger_write schema; ch55 D-188 (local-first routing), D-189 (usage counter).
When the system escalates to cloud
Two triggers send a single request to Claude (Anthropic) instead of the local stack:
- The customer’s outstanding balance is over €5,000 — higher stakes, higher bar for the draft.
- The local model flags low confidence on its own output.
In both cases, the escalation is automatic, logged, and visible on the cost dashboard. Cloud escalation uses Claude only, by design. No other cloud provider is wired in.
Where the AI runs — and why
Running the AI on the Framework Desktop means the data never leaves Gennevilliers. Customer information, AR balances, contract details, and email content stay inside the building. The hardware is a one-time cost that Sodimo already absorbed; day-to-day AI use has no per-request bill attached.
Target: 70% of AI requests answered locally. The dashboard tracks this in real time.