Chapter 6 / 40

Three design principles

Three principles run through every tool and wiring decision in the rest of this manual. They are the reason the stack looks the way it looks, and the reason some obvious-sounding alternatives were ruled out. A reader who only ever reads this chapter still understands why the system is shaped this way.

Status: Live — every chapter from here forward is written to these three principles. If a later chapter appears to contradict one of them, the principle wins and the chapter is wrong.

Principle 1 — Conceptual fork over upstream adoption (static at handoff)

The system Sodimo inherits must be functional and static on the day the engagement ends. Every component is either a pinned-tag adoption of upstream or a Sodimo-owned conceptual fork; nothing chases :latest.

Why. After handoff, Sodimo has no full-time engineer. Every upstream breaking change between then and the day someone is hired becomes a liability. The team should hire deliberately, not urgently.

How to apply. When designing a new component, classify it first:

Ask: does Sodimo differentiate here? If no → adopt upstream as a pinned Podman Quadlet tag. If yes → conceptual fork: lift a known-good snapshot, scope down to what Sodimo actually uses, and evolve independently.
Ask: does it touch a Sodimo-specific integration point (run_ledger, Sodiwin black-box pull, MCP tool surface)? If yes → the glue layer is forked even when the commodity underneath is adopted.
Pin the version everywhere — image tag in the quadlet unit, or vendored source at a known-good snapshot. No Renovate, no auto-pull, no timer upgrades.
Scope down aggressively — disable every upstream capability Sodimo does not use; port only the functions Rani, Paul, and Michel actually touch.
Apply the future-component test: “if the engineer vanishes and upstream ships a breaking change next month, does anything at Sodimo break?” The answer must be no.

Pointer. → ch38 Quadlet reference for the adopt cases; ch51 The repos for the forked surface; ch55 Annex decisions D-178 for precedent list.

Principle 2 — Token accounting across every run

Every AI invocation in the Sodimo stack emits a usage record to a single append-only ledger. Regardless of surface — interactive chat, scheduled agent, email auto-reply, MCP tool call — at any moment the stack can produce the full list: every token Sodimo’s AI has consumed, local and cloud, by when, why, and who.

Why. The core economic story of the engagement is that local AI saves money relative to cloud AI. That claim is only valid if it can be proven. Without end-to-end accounting the stack has an anecdote, not a number. The ledger is also a long-horizon data asset: it tells future engineers exactly when local AI pays off at SMB scale.

How to apply. For every new quadlet, MCP tool, or skill, answer one question before it ships: how does this emit a run record? If the answer is “it doesn’t,” the component is not ready.

Every run gets a row in D1 table run_ledger, append-only, timestamped.
Minimum schema: run_id, ts, surface, user_id, agent_id, model, provider, tokens_in, tokens_out, tokens_cached, latency_ms, outcome, cost_eur, cost_eur_if_cloud.
cost_eur_if_cloud carries the narrative — the counterfactual top-tier cloud cost per run; summed, counterfactual minus real is the headline savings number.
Surface a chapter-level dashboard on the changelog site: tokens consumed, actual cost, counterfactual cost, savings.
Queryable by surface/model/user/date; exportable to CSV for the savings study.

Default is local; cloud is escalation. The routing rule that makes the savings number meaningful: every AI invocation defaults to the on-prem llama-swap stack (local-task for chat and tool-use, local-heavy for reasoning, local-coder for code). Cloud (Claude Opus via a cloud-heavy alias) is opt-in per-invocation, not the starting point. Starting at Opus and falling back to local inverts the economics of the ledger. A daily Fattal-commentary run lands on local-heavy by default; a legal-contract review that needs Claude Opus is explicitly flagged in the tool invocation (escalate: "cloud-heavy"). The ledger captures both: cost_eur is 0 for local runs, cost_eur_if_cloud records the counterfactual. That delta is the savings number. See ch41 AI on-prem for the full escalation-trigger enumeration and the usage-counter surface (D-188, D-189).

Pointer. → ch36 OpenWebUI and ch42 MCP tools for emission call sites; ch41 AI on-prem for the escalation policy; ch55 D-179 for the schema lock, D-188 for local-first routing, D-189 for the usage counter.

Principle 3 — Cloudflare is the single MCP surface

The Sodimo stack has exactly one MCP endpoint: a Cloudflare Worker called sodimo-core, fronted by Cloudflare Access with Google Workspace as the identity provider. There is no on-prem MCP server. Humans reach on-prem services through native UIs over Cloudflare Access or Tailscale; code reaches the harness through pull-based wiring — the Worker writes to a Cloudflare Queue, a small systemd service on the harness drains it. The harness exposes no inbound port for agent traffic.

Why. The on-prem MCP inventory walk turned up empty — every candidate had a better path (human UI, admin over SSH, UI wrapper) or was out of scope. A single Cloudflare surface also sidesteps the Claude.ai OAuth-drop bug affecting self-hosted MCP behind tunnels, and avoids committing to any 2026 MCP gateway tooling, which is too young to pin statically.

How to apply.

New agent-callable capability → add a tool to the sodimo-core Worker. Do not stand up a second MCP endpoint.
New agent → harness side-effect → the Worker writes to a queue, a small puller on the harness drains and acts. The harness opens no inbound port.
New human → on-prem service access → front it with Cloudflare Tunnel + Cloudflare Access (Google Workspace IdP). This is an HTTP-access surface, not an MCP surface, and carries no agent traffic.
Symmetric application — the load-bearing clause. The same pull-based shape applies to anything running on the harness. Paperclip agents do not shortcut to local sendmail even though they physically could; they call the Worker email_send tool exactly the way Claude.ai does. Fedora → Cloudflare → Fedora is a feature, not a cost.
Four properties depend on the symmetric clause and are permanently lost if any producer bypasses the Worker: a unified run ledger, a single sender-authorization policy, one observability schema, and centralized rate-limiting with a dead-letter queue. Local co-location is a physical fact, not an architectural permission to bypass the contract.

Pointer. → ch42 MCP tools for the tool surface; ch31 Network and ch33 Cloudflare for the tunnel mechanics; ch44 Paperclip for the symmetric-application worked example; ch55 D-180 for the hybrid-architecture supersession.

How the three interact

Principle 1 (own the code) makes Principle 2 (emit a run record) natural: if Sodimo owns the code, emission is written in from day one. Principle 3 (one MCP surface) makes Principle 2 enforceable: every AI-driven side-effect routes through the Worker, which is where the ledger write happens. Principle 1 and Principle 3 close the loop on gateway tooling: the only plausible reason to adopt a young upstream tool on the harness would be an MCP gateway, and Principle 3 says there is no job for one to do.

These three are the spine. The rest of the manual is how they land on real hardware, real code, and real people.