# Tallie AI — Full Blog Corpus > An AI control layer for finance, operations, and sales teams that > runs on customer infrastructure with the model the customer chooses. > This file concatenates every blog post on tallie.ai into a single > markdown document for one-shot LLM ingestion. Companion to > https://tallie.ai/llms.txt. Source: https://tallie.ai/blog Posts: 10 Generated: 2026-04-22T10:14:57.145Z --- # CubeSandbox Lands — the Other Half of Customer-Controlled AI > Two weeks ago, the open-source agent stack was missing a credible in-your-environment sandbox. With Tencent's CubeSandbox release, the gap is closed — and customer-controlled finance AI just became a much shorter procurement conversation. Published: 2026-04-22 Updated: 2026-04-22 Author: Archie Norman (Founder, Tallie AI) Category: Architecture Tags: cubesandbox, agent-sandbox, microvm, kernel-isolation, customer-controlled, open-source, finance-ai Canonical URL: https://tallie.ai/blog/cubesandbox [Tencent Cloud open-sourced CubeSandbox](https://github.com/TencentCloud/CubeSandbox) — a KVM-backed MicroVM sandbox built specifically for AI agents to execute code in. It is Apache 2.0, sub-60ms cold start, under 5MB of overhead per sandbox, and — the part that matters most for procurement — a drop-in replacement for the E2B SDK that until now has been the dominant *closed-source* option for this layer. If you read [yesterday's note on Kimi K2.6](/blog/kimi-k2-6), this is the other shoe dropping. K2.6 made it credible to run a *frontier-class model* inside your own environment. CubeSandbox makes it credible to run the *agent's code execution sandbox* inside your own environment too. Same month, same architectural direction, both Apache-licensed. The customer-controlled finance-AI story just stopped having a hole in the middle. ## What a sandbox does, and why it has been the awkward layer When an AI agent does anything more interesting than answering a question — runs a Python script over a CSV, executes a query against a warehouse extract, opens a browser to scrape a CRM, automates a spreadsheet — it has to do it *somewhere*. That somewhere needs three properties: 1. **Isolation strong enough that the worst the agent can do is destroy its own sandbox**, not the host, not other tenants, not your network. 2. **Fast enough that you can spin one up per task** without the latency dominating the agent's run. 3. **Cheap enough that you can run thousands in parallel** when an agent fans out into sub-tasks. Until very recently, the three properties were a pick-two. Docker containers gave you 1 cheaply but isolation was the polite-fiction version (shared host kernel, well-documented escape patterns). Traditional VMs gave you real isolation but the boot time and memory overhead made per-task sandboxes uneconomic. E2B and a few other commercial offerings solved the trilemma — for a price, in their cloud, with your data crossing their perimeter. For most consumer AI products that trade-off has been fine. For finance — where the agent is reading the trial balance, the bank file, or the customer contract — sending all of that through a third-party sandbox provider has been the same procurement blocker as sending prompts to a closed model API. CISOs do not love it, and increasingly will not approve it. CubeSandbox is the first credible open-source answer to that trilemma. KVM hardware-isolation per sandbox, sub-60ms cold start, sub-5MB overhead, eBPF-based network policy. And it runs wherever you can run a KVM-enabled Linux host — your VPC, your on-prem GPU box, your sovereign-cloud tenancy. ## Why E2B-compatibility is the actual headline Buried in the README is the line that matters most for procurement: **drop-in E2B SDK compatibility**. Swap one URL environment variable, the same agent code keeps working. That changes the migration economics in a way that is easy to under-rate. In normal infrastructure terms, replacing a sandbox vendor is a re-platforming project — different SDK, different lifecycle hooks, different file-system model, different network semantics. Three months minimum, often six. With drop-in compatibility, it becomes a config change you can run in shadow for a week and cut over on a Tuesday. It is the sandbox-layer analogue of what per-task model routing did for the model layer. If your agent platform was wired directly to E2B, this matters to you in the same way that K2.6 matters if your agent platform was wired directly to OpenAI. ## What this completes, alongside K2.6 Stack the two releases together and the customer-controlled finance AI argument becomes *concretely buildable* end-to-end: - **The model** runs in your environment (open-weights frontier model — K2.6 or one of its peers). - **The execution sandbox** runs in your environment (open-source MicroVM sandbox — CubeSandbox). - **The control plane** runs in your environment (your orchestrator, your audit log, your skills definitions). - **The data** never leaves your environment in the first place (warehouse, ledger, ops systems read in place). A month ago, that bullet list required at least one closed-source vendor in the data path. Today it does not. That is a material change in what "customer-controlled AI" means as a procurement statement, not just as a pitch deck slide. Which is exactly the architecture we built Tallie around — the precise reason a finance estate should not be locked to a single model, a single sandbox, or a single deployment posture. The market keeps producing events that vindicate that bet. K2.6 was one. CubeSandbox is the next. There will be more. ## The honest caveats This is news commentary, not a sales pitch. The caveats matter. - **KVM-only, x86_64-only.** CubeSandbox needs a KVM-enabled x86_64 Linux host. No ARM (yet), no macOS dev environments without WSL, no shared-tenancy environments where you cannot get to the hypervisor. For a lot of enterprise infrastructure that is fine. For some it is a real constraint. - **"Drop-in E2B compatibility" is rarely 100%.** It is usually 95%, with the last 5% being whatever your agent depends on at the edges. Plan a real shadow-test window before any cutover. Drop-in is a starting point, not a finish line. - **Operational maturity.** A KVM/MicroVM/eBPF stack is more moving parts than a Docker container. The team operating this needs to be comfortable with kernel-level Linux, virtualisation, and network namespaces. "Set and forget" it is not. - **Vendor-trust questions.** Tencent Cloud is a Chinese cloud provider, and a procurement team in some sectors will have the same conversation about CubeSandbox they had about Kimi K2.6. Apache 2.0 source code is *materially easier* to evaluate than open weights — it is auditable line-by-line — but the conversation still happens. Have an answer ready. - **Production-at-scale claims.** Tencent says CubeSandbox is "validated at scale in Tencent Cloud production." That is their claim. Until the open-source community has run it in anger for a few months, treat the published cold-start and density numbers as the floor of what to validate, not the ceiling. None of those kill the thesis. They are the work that makes adopting open-source agent infrastructure a serious engineering exercise — which is precisely what it should be. ## What we'd do this month If you are running a Tallie deployment, two practical moves: 1. **Stand up CubeSandbox in a non-production environment** alongside whatever sandbox layer your agent currently uses. Run a representative workload in shadow for a week. Compare cold-start latency, density per node, sandbox isolation behaviour under load, and total cost per agent-hour. Vendor benchmarks do not capture the things that bite you in production. 2. **For any workload currently blocked on "we cannot send agent code execution to a third-party sandbox provider,"** this is the layer to evaluate now. Combined with an open-weights model running on the same hardware, you have a route to keeping the entire agent stack inside your perimeter. If you are not running a Tallie deployment, the procurement question to put to your AI vendor is the same shape as last month's: can your platform run on this sandbox, in our environment, with this model, this week? If the answer requires a roadmap conversation, the answer is no. ## The pattern, again K2.6 last week. CubeSandbox this week. The model layer and the execution layer of the open-source agent stack each took a serious step forward, in the same month, both fully self-hostable, both with Apache-class licenses on the parts that matter for procurement. The teams whose architecture treats *both* "which model" and "which sandbox" as routing decisions will absorb both releases over a long lunch. The teams whose stack was wired into a single closed model and a single closed sandbox have a longer week ahead of them. That is the bet. The market keeps making the case for us. --- # Agentic Finance Workflows on SunSystems: A Forward-Deployed Pattern > SunSystems is exactly the kind of system where agentic finance workflows pay off — a stable, structured ledger sitting underneath brittle, swivel-chair processes. Here is the pattern we use to layer agents on top of it without breaking anything. Published: 2026-04-21 Updated: 2026-04-21 Author: Archie Norman (Founder, Tallie AI) Category: Implementation Tags: sunsystems, infor, ssc, agentic-finance, erp, month-end, implementation Canonical URL: https://tallie.ai/blog/agentic-workflows-on-sunsystems ## TL;DR - SunSystems is a near-perfect substrate for agentic finance: a stable XML payload API (SunSystems Connect), a fully-typed component surface, and a ledger model that has not changed shape in twenty years. - The right unit of work is a `skill` mapped to one or more SSC components — not a free-form prompt that 'calls SunSystems'. Skills make every agent action reviewable, scoped, and reproducible against a given Business Unit. - Mutation safety on a ledger API is the same problem as warehouse SQL safety, with one extra rule: every write goes through `ValidateOnly` first, every Journal Import carries a deterministic `MethodContext`, and every payload is logged against the originating user's ``. - The deployment posture that actually clears procurement is the agent runtime sitting next to SunSystems — same VPC or on-prem segment — talking to SSC over the loopback, with the LLM call routed wherever the customer wants it routed. Most of the agentic-finance conversation in 2026 is happening on top of NetSuite or some Snowflake-shaped data lake. That is not where most finance teams actually live. A surprising number of them live on SunSystems — Infor's mid-market ledger that quietly runs the books for a long tail of hospitality groups, NGOs, real-estate operators, multi-currency professional services firms, and shipping companies. Most of these teams have a stable ledger, an unhappy operations team, and a `Transfer Desk` window left open on someone's second monitor. We have done a few of these now. The pattern is consistent enough to write down. (For the cloud-native sibling of this post — same shape, different plumbing — see [Agentic Finance Workflows on Xero](/blog/agentic-workflows-on-xero).) ## Why SunSystems is a good substrate for agents The instinct, when you hear "AI for SunSystems," is sympathy. SunSystems is old. The UI is older. There is a `.NET` thick client involved somewhere. Surely the *modern* stack is the place to start. That instinct is wrong. SunSystems is, in practice, an unusually good substrate for agentic workflows, for three structural reasons: 1. **The integration layer is fully typed and payload-driven.** SunSystems Connect (SSC) is an XML-payload API that exposes the entire ledger surface — every component you can drive in the UI is reachable as a `` against the same component. Each component has a small set of methods (`Add`, `Update`, `Delete`, `Query`, plus component-specific ones like `Import` and `ValidateOnly`), and each method has an explicit, documented payload schema. For an agent, this is closer to a well-designed SDK than to "an ERP integration." 2. **The ledger model has been stable for twenty years.** Chart of accounts, journal lines with up to ten analysis dimensions, business units, budgets, allocation rules, value labels — the shape has not meaningfully changed since the 5.x line. Skills built against the SSC component surface today will still work after the next upgrade, because Infor maintains backward compatibility on the payload shape with religious discipline. 3. **Security is record-level and pre-existing.** Data Access Groups, Miscellaneous Permissions, Business Unit Administration — the role model SunSystems shipped before "principle of least privilege" was a phrase a vendor would use is exactly the role model an agent should run inside. We do not need to invent an authorisation system. We need to honour the one already in production. A modern agent runtime talking to a 25-year-old ledger sounds like a mismatch. It is not. It is a stable, structured back end finally getting an interface that can talk to its actual users — through their words, on the screen they are already looking at. ## The wrong way to do this The tempting first move is "let the LLM call SunSystems." A model with tool access, a tool that wraps the SSC HTTP endpoint, and a system prompt that says "be careful." Demos beautifully. Survives no procurement cycle. Falls over the first time a user pastes a journal description that contains the word "delete." There are three failure modes that show up in week one: - **Unbounded mutation surface.** The model can, in principle, emit a payload against any component. The blast radius of a single confused turn includes the chart of accounts, supplier records, allocation rules, and the actual ledger. - **No reproducibility.** A free-form tool call is a one-off. There is no artifact a controller can review, edit, version, or hand to an auditor. "What did the AI do last quarter?" has no answer. - **No respect for SunSystems' own controls.** If the agent connects as a single service account, every Data Access Group and Miscellaneous Permission you have spent a decade configuring is bypassed. This is the part that ends the procurement. The fix is not "a more careful prompt." It is the same shift we apply elsewhere: stop letting the model decide *what kind of action* is happening, and only let it decide *the parameters within an action*. The thinking is the same as the four-layer model we use for [LLM-generated SQL against a warehouse](/blog/warehouse-sql-safety) — extended one extra step, because here the agent can mutate the system of record, not just read from it. ## Skills, not free-form tool calls The unit of work in our SunSystems deployments is a **skill**: a versioned, code-reviewed artifact that describes one finance operation, the SSC component(s) it uses, the parameters the agent is allowed to set, and the validation that runs before any payload reaches `Import`. ![Skill execution flow: the agent proposes a payload, the skill runs SSC ValidateOnly, the response is shown to the user, and only on user approval does the skill issue the real Import.](/blog/diagrams/sunsystems-skill-flow.png) A skill is not a prompt. It is a small bundle, deployed alongside the agent runtime, that contains: - **The component method it wraps.** For example: `AccountAllocations.Update` for re-coding analysis on existing transactions, or `Journal.Import` for posting a correcting journal. - **A scoped parameter schema.** The skill declares which fields the agent is allowed to populate (e.g. `AnalysisCode3`, `Description`, `AccountRange`) and which are fixed by the skill itself (e.g. the `MethodContext` block, the `BusinessUnit`, the `JournalType`, the suspense account). - **A `ValidateOnly` step.** Every mutation skill emits its payload against the component's `ValidateOnly` (or, for Journal, the explicit `ValidateOnly` method) before the real `Import`. The validation response is fed back to the agent, which has to acknowledge it before the real call is allowed. - **A user identity contract.** The `` element on every payload is the SunSystems identity of the *person* the agent is acting on behalf of — never a service account. If that user does not have permission for the operation, SunSystems rejects it. We do not have to build a parallel permissions layer; we just have to honour the existing one. - **A deterministic audit log line.** Every skill execution writes a structured record: who asked, which skill ran, what payload was sent, what `ValidateOnly` said, what `Import` returned, and which model produced any free-text fields. This is the part where SunSystems actually helps. The component model is small enough — a few dozen genuinely useful components for a typical deployment — that authoring skills is a tractable, week-of-work exercise rather than an indefinite engineering programme. Most of the ones a finance team needs already exist by the time we are talking to them. ## The four workflows that pay off first Across the engagements, four shapes show up over and over. They are the ones we recommend starting with: ### 1. Re-coding analysis on already-posted transactions The single most common operational ask: "we mis-coded a chunk of transactions to the wrong project / cost centre / dimension; can we fix them without reposting." This is what `LedgerAnalysisUpdate` and `AccountAllocations.Update` exist for, and SunSystems users with the rights have always been able to do it via Transfer Desk. The friction is *finding* the transactions to update — usually via a Q&A pass against the ledger, an export to Excel, a manual review, and a re-import. The skill version, end-to-end: - A project lead asks: "move all March travel expenses for the legal team off project `LEG-22` and onto `LEG-23` — that work was rebilled." - The agent runs a `Journal.Query` filtered by `AccountRange = 6700-6799`, `AnalysisCode1 = LEG`, `AnalysisCode2 = LEG-22`, period `202503`, and presents the 47 matched lines back to the user with the proposed `AnalysisCode2` change inline. - The user spots two lines that should not move (a re-billable expense already invoiced under `LEG-22`) and excludes them with a click. - The skill issues a single bounded `LedgerAnalysisUpdate` against the remaining 45 lines, with a `ControlTotal` that guarantees the count and value match what was approved. The skill physically cannot operate outside the user's Business Unit, the Budget Code is fixed by the skill author, and every line touched carries the user's `` in the audit trail. What used to be a 90-minute Excel-and-Transfer-Desk ritual becomes a two-minute conversation. The agent did the boring part — the search, the materialisation, the bounded update. The control regime did not change. ### 2. Driving period-end allocation runs Allocation rules in SunSystems are powerful and unloved. The rules exist; nobody remembers which ones to run, in which order, against which period. The `AllocationRun` component is built for exactly this orchestration, and a skill can wrap it. A typical engagement: a hospitality group with eleven properties and a shared services overhead pool. Every month-end a controller has to run six allocation rules in a specific order — first the head-office overhead split, then the regional manager apportionment, then the FX hedge cost reallocation — against each of three Business Units, in dry-run mode first, then for real. The current process is a handwritten checklist taped to a monitor. The skill version: the controller types "run the standard Q1 month-end allocations for Properties UK, Properties EU, Group Services." The agent walks the documented sequence, fires `AllocationRun` in dry-run mode for each, surfaces the proposed entries with totals per allocation, and waits for approval before issuing the real run. If a rule errors (a missing source-account balance, a denominator that resolved to zero), the agent surfaces the SSC error message verbatim, not a paraphrase. The skill is mutation-light because the allocation rules themselves are unchanged — the agent is firing pre-existing engines, not writing journal lines from scratch. The win is removing the "did I run rule 4 before rule 5 against EU?" anxiety from the close. ### 3. Validated correcting journals from natural language The "post a journal" use case is the one that makes auditors nervous. The trick is to never let the model post anything; only let it *propose* something. The flow is: 1. The agent gathers the proposed journal as an SSC `Journal` payload. 2. The skill runs `ValidateOnly` against SunSystems and returns the response — every error, every warning, every substituted value — verbatim to the user. 3. The user, looking at exactly what SunSystems said, approves or rejects the post. 4. Only on approval does the skill issue the real `Import`, with a `MethodContext` that the *skill author* defined: posting type, suspense account, layout code, balancing options. Not the model. A concrete shape: a finance manager at an NGO needs to reclassify £18,400 of grant income that was posted against the wrong donor analysis code in March. She types "draft a correcting journal moving the £18.4k from donor `D-2024-FCO` to `D-2024-FCDO` for March, narrative 'donor recoded — see ticket FIN-1184'." The agent builds a balancing two-line journal payload, runs `ValidateOnly`, and surfaces SunSystems' response: the proposed lines, the GBP total, the period, the analysis substitutions, and one warning that `D-2024-FCDO` is flagged as restricted-fund — would she like the skill to also tag the recipient analysis code as restricted? She approves; the skill `Import`s the journal with the controller's pre-fixed `MethodContext` and writes the SSC response (including the assigned `JournalNumber`) into the audit log. The model never decides whether a journal is valid. SunSystems does. The model is in charge of phrasing the question. ### 4. Drillable narratives for management accounts The lightest-mutation workflow and often the most popular: take a `Journal.Query` or `AccountBalance` result, fold in the Business Unit's analysis dimensions, and produce a narrative — by department, by project, by analysis code — that a finance lead can scan in two minutes instead of building a pivot for thirty. A typical case: a multi-site operations director wants a Monday morning summary of "what moved in week 16 across the regions." The skill pulls `AccountBalance` for the operating cost accounts across each region's Business Unit, joins to the regional analysis dimension, computes period-on-period and budget-vs-actual deltas, and produces a one-page narrative: "EU region trade spend up £42k week-on-week, driven by three Tier-1 promotional campaigns in DE; UK region utilities down £8k vs last week as the new contract took effect from week 14; Asia region travel costs above budget by £11k, attributable to the Singapore conference (project `MKT-SGAPAC`)." Every figure in that narrative is a click away from the underlying SSC payload — the journal lines, the analysis codes, the user who posted them. The agent did not invent the numbers. It composed them. ## The deployment posture that actually clears procurement SunSystems is, almost by definition, not internet-facing. It runs in a private hosted environment or on-prem; SSC sits behind the same firewall as the database. Any integration architecture that involves "let our cloud reach into your network" stops at the security review. We support two deployment shapes, and both honour the same control regime: - **Customer-perimeter deployment.** The agent runtime runs inside the customer's own perimeter — same VPC, same datacentre segment, sometimes literally the same Windows Server hosting the SunSystems application service — talking to SSC over the loopback or a private VLAN. The model call is the only thing that crosses the boundary, and it does not have to: for sensitive workloads we route to a self-hosted open-weights model on the same infrastructure as the agent, with no prompt or response ever leaving the perimeter. - **Tallie-managed cloud warehouse.** For teams that want the operating model without operating the infrastructure, the agent runtime and a SunSystems-aware warehouse run inside Tallie's managed cloud — segregated per customer, with SSC traffic over a dedicated private link to the customer's SunSystems environment. The customer still picks the LLM: their own OpenAI / Anthropic / Bedrock contract, a self-hosted open-weights model in their cloud, or a model hosted by us on their behalf. Routing is per-task and reversible. In both shapes, three things remain the same: - **The LLM is a customer choice.** Per task. Reversible. Including "use ours, not yours." - **Skills, prompts, and the run log are owned by the customer** and stored where they want them stored — their object storage, ours, or both. - **Every SSC payload still carries the user's ``** and is subject to their existing Data Access Groups and permissions. The deployment shape changes; the control regime does not. This is the consistency that matters. "Customer-controlled AI" is not a deployment claim — it is a claim about who decides where the agent runs, which model it routes to, and how it touches data. Both deployment shapes preserve all three decisions for the customer. ## What this looks like on day 90 A finance team using SunSystems with this pattern in place is not doing AI. They are doing finance, slightly faster. The visible changes are small: - A pane next to Transfer Desk, in their existing SunSystems hosted environment, where they can ask for transactions, propose corrections, draft journals, and run allocations in plain English. - Every action they take with that pane shows up in their existing SunSystems audit, against their existing user identity, scoped to their existing Business Unit. - Their CISO has a one-page architecture diagram that contains the words "no data egress," "read-only by default," and "every mutation goes through `ValidateOnly`." Their auditor has the run log. The general ledger is still in SunSystems. The chart of accounts has not moved. The integrations to Cognos, the bank statement loaders, and the consolidations engine still work exactly as they did. Nothing was rebuilt. That is what "AI for SunSystems" looks like when it is taken seriously: not a replacement for the ERP, not a wrapper that ignores its controls, but a thin agent layer that finally lets a structured, twenty-year-old back end be operated through a 2026 interface — without breaking the procurement, the audit, or the upgrade path. > **Building on SunSystems and want a closer look at this pattern?** This is the kind of engagement we are explicitly built for: forward-deployed engineers, customer-controlled AI runtime, skills authored against your own component surface and your own Business Units. [Get early access](/#join) and we will set up an architecture call. --- # Agentic Finance Workflows on Xero: A CFO and Project-Lead Pattern > Xero is the cloud ledger most growth-stage CFOs and project-led businesses actually run on. Here is how we layer agents on top — for the CFO who needs cash, runway, and margin answers in minutes, and the PM tracking live budget variance across a portfolio of projects — without ever owning the keys. Published: 2026-04-21 Updated: 2026-04-22 Author: Archie Norman (Founder, Tallie AI) Category: Implementation Tags: xero, accounting-api, agentic-finance, cfo, project-management, budgets, oauth, implementation Canonical URL: https://tallie.ai/blog/agentic-workflows-on-xero ## TL;DR - Xero is a near-perfect substrate for agentic finance for the inverse reasons SunSystems is: a fully-typed REST surface, scoped OAuth 2.0 per write capability, webhooks for event-driven triggers, idempotent writes, and multi-tenant access from a single connection. - The right unit of work is a `skill` mapped to one or more Accounting API endpoints — with a fixed OAuth scope set, an `Idempotency-Key` on every write, and a `Status: DRAFT` first-pass for anything that posts to the ledger. Free-form tool calls against Xero do not survive App Partner review. - The five workflows that pay off first for a CFO and a project-led team are: a Monday-morning cash-and-runway briefing, live project P&L with budget variance, customer and project portfolio profitability, multi-entity consolidation across a single OAuth grant, and AR follow-up framed as a working-capital lever. - Customer-controlled AI on a cloud-native ledger means a different architecture from on-prem: refresh tokens stored in the customer's secret store, the agent runtime in the customer's VPC or Tallie's segregated cloud, the LLM call routed per-task — and the App Partner data-handling story written down before a single skill ships. We have written about layering agents on top of [SunSystems](/blog/agentic-workflows-on-sunsystems) — a private-hosted, XML-payload, on-prem-shaped ledger. Xero is the other end of the spectrum: cloud-native, REST-shaped, OAuth-mediated, multi-tenant from day one, and overwhelmingly the system of record for the long tail of growth-stage CFOs, project-led services and construction businesses, ecommerce brands, hospitality groups, and the accounting practices that serve them. It is also the system where the highest-value agent workflows are not the ones AI vendors lead with. The bookkeeping pitches — bank rec, bill capture — are real and have already been spoken for. The space that has been left empty is the one a CFO actually lives in: cash and runway answers in minutes instead of Sunday evenings, live project P&L instead of month-end PDFs, customer-level profitability that is current rather than retrospective, group consolidation that runs on demand instead of in week one of next month. This post is a sketch of the pattern we use to deliver those workflows on Xero — for a CFO, a finance manager, and a project lead — without the agent ever owning the keys, breaking the OAuth contract, or surviving past a `disconnect` from the connected-apps screen. It looks the same as the SunSystems pattern in shape and very different in plumbing. ## Why Xero is a good substrate for agents The instinct on Xero is the inverse of SunSystems: people assume cloud-native + REST = trivial integration, and stop thinking. Xero rewards a more careful read. There are four structural reasons Xero is unusually well-suited to agentic finance: 1. **The Accounting API is fully typed and scope-segmented.** Every meaningful object — `Invoice`, `BankTransaction`, `ManualJournal`, `Contact`, `Account`, `TrackingCategory`, `Payment`, `BatchPayment`, `CreditNote`, `Quote`, `PurchaseOrder` — has a documented, versioned schema, an explicit lifecycle (`DRAFT` → `SUBMITTED` → `AUTHORISED` → `PAID`/`VOIDED`/`DELETED` for invoices; `DRAFT` → `POSTED` for journals), and a separate read scope and write scope. Granting an agent `accounting.transactions.read` lets it see invoices and bank lines; it physically cannot create one. That is exactly the granularity an agent capability model needs. 2. **Writes are idempotent by design.** Every mutating call accepts an `Idempotency-Key` header — a UUID the client generates and Xero remembers for 24 hours. Replay the same key, get the same result; you can hammer the API on a flaky network and never produce a duplicate journal. For an agent runtime that retries on failure, this is the difference between safe and dangerous. 3. **Webhooks turn the system of record into an event source.** Subscribe to `INVOICE` or `CONTACT` events and Xero will push a signed (HMAC-SHA256) notification on create or update, scoped per tenant. An agent runtime no longer has to poll; it can react to "invoice authorised" or "contact updated" and run the right skill at the right time. 4. **Multi-tenant is native.** A single OAuth 2.0 grant can authorise an app against many Xero organisations at once, each addressed by a `Xero-tenant-id` header. For accounting practices running 50, 200, or 800 client organisations, this turns "I want to sweep every client for unallocated bank lines" from an integration project into a `for tenant in tenants:` loop. You can build the same patterns we built for SunSystems on Xero — and you get cheaper compose-ability, real-time triggers, and free retry safety, in exchange for losing the "everything stays inside one VPC" story. That trade is worth understanding before designing skills against it. ## The wrong way to do this The tempting first move on Xero is "give the LLM a tool that wraps `POST /Invoices` and a system prompt that says 'be careful.'" Demos beautifully. Survives no App Partner review. Falls over the first time a user says "raise an invoice for May" and the model raises twelve, because the retry policy kicked in and there was no `Idempotency-Key`. The CFO-shaped failure modes are quieter and more dangerous. The model produces a beautiful cash narrative — anchored on a balance that was correct yesterday but ignores a £180k bank settlement that hit overnight, because nobody told the agent to refresh first. The model claims a project is on budget — using a `TrackingCategory` that was renamed last quarter and now silently excludes 30% of the costs. The model writes a board-pack paragraph that compares actuals to a budget version that was superseded six weeks ago, because the `Budgets` endpoint returns multiple versions and the skill picked the first one. There are six failure modes that show up in week one: - **Over-broad OAuth scopes.** The app requests `accounting.transactions` (the write scope) for everything, regardless of which skill is being run. A controller looking at the consent screen sees "this app can create, edit, and delete invoices, bills, manual journals, and bank transactions across the whole org" and quite reasonably denies it. The agent now has the keys to a kingdom it never needed to enter. - **Unbounded mutation surface.** Without per-skill capability scoping, the model can in principle emit a `POST` against any endpoint the connection has scope for. The blast radius of one confused turn includes the chart of accounts, contact records, and the live ledger. - **No `DRAFT` first-pass.** Free-form tool calls go straight to `Status: AUTHORISED` because the demo script wanted to show "AI posted an invoice." Real customers find out their first AI-generated invoice was sent to a customer with the wrong tax rate, and the recovery is a credit note plus a phone call. - **No `Idempotency-Key`.** A network blip retries the call. Xero, with no idempotency key to deduplicate against, accepts it. Now there are two invoices, two emails, and one annoyed customer. - **Stale read.** The agent answers "what is our cash position?" against a cached `BankSummary` that is six hours old; the actual answer is materially different because a wages run cleared overnight. CFOs detect this within a week and never trust the system again. - **Silent dimension drift.** The agent uses a `TrackingOption` name in its prompt or skill config; an admin renames the option in Xero; the skill keeps emitting the old name; the project P&L now silently drops everything tagged with the new one. The number on the page is wrong by 30% and looks plausibly right. The fix is not "a more careful prompt." It is the same shift we apply elsewhere: stop letting the model decide *what kind of action* is happening, and only let it decide *the parameters within an action*. The thinking is the same as the four-layer model we use for [LLM-generated SQL against a warehouse](/blog/warehouse-sql-safety) — extended by Xero's specific defences (scope per skill, idempotency per write, `DRAFT` per post). ## Skills, not free-form tool calls The unit of work in our Xero deployments is a **skill**: a versioned, code-reviewed artifact that describes one finance operation, the Xero endpoints it calls, the OAuth scopes it requires, the parameters the agent is allowed to set, and the review gate that runs before any write reaches `AUTHORISED`. A Xero skill is a small bundle, deployed alongside the agent runtime, that contains: - **The endpoints it touches.** For example: `GET /BankTransactions` + `GET /Invoices` + `POST /BankTransactions` for a bank reconciliation skill, or `POST /Invoices` + `POST /Attachments` for an AP intake skill. - **The exact OAuth scope set.** The bank-rec skill above runs with `accounting.transactions.read` + `accounting.contacts.read` + `accounting.transactions` (the write scope, only because of the final reconciliation step). The AR follow-up skill runs with `accounting.reports.read` + `accounting.contacts.read` only — it physically cannot post anything. - **A scoped parameter schema.** The skill declares which fields the agent is allowed to populate (e.g. `LineItems`, `Reference`, `DueDate`, `Tracking`) and which are fixed by the skill (e.g. `BrandingThemeID`, `Status: DRAFT` on initial post, `LineAmountTypes`). - **A `DRAFT` review gate.** Every Invoice, Bill, Credit Note, and Manual Journal is created with `Status: DRAFT` first. Xero applies its own tax calculation, GL coding, and validation against the chart of accounts, and the materialised draft is shown back to the user. Only on explicit approval does the skill `POST` again with `Status: AUTHORISED`. - **An `Idempotency-Key` policy.** Every write carries a deterministic UUID derived from the skill run id and the operation index, so retries within 24 hours are safe. - **A per-tenant call budget.** Xero enforces 60 calls/minute and 5,000 calls/day per tenant (10,000 for partner apps); the runtime respects this so a runaway skill cannot starve other skills against the same org. - **A user identity contract.** The OAuth grant is held against a specific human user; their Xero permissions are the agent's permissions. A staff user with "Read Only" cannot, via the agent, mutate anything — the API rejects it server-side. - **A deterministic audit log line.** Every skill execution writes a structured record: who asked, which skill ran, which tenant it ran against, what scopes were used, what payload was sent, what `DRAFT` came back, what the user approved, what `AUTHORISED` returned, the `Idempotency-Key` used, and which model produced any free-text fields. Authoring a Xero skill is a half-day exercise once the pattern is in place. The Accounting API is small enough — a few dozen genuinely useful endpoints for any one customer — that the catalogue gets covered quickly. ## The five workflows that pay off first The Xero workflows where agents earn their keep are not the ones marketed loudest. They are the ones a CFO or a project lead would build themselves if they had unlimited time. The five below are the shapes we see compound the hardest across our engagements — they go from "Sunday-evening Notion doc" or "month-end PDF" to a live answer the same week the OAuth grant is signed. The bookkeeping layer (bank reconciliation, AP intake from Hubdoc, AR sweeps) is real and we ship it — but it is the floor, not the ceiling, and we have written it up briefly at the end. The CFO and project workflows are where the curve changes. ### 1. The CFO's Monday-morning briefing The single most-asked-for workflow we hear from growth-stage CFOs on Xero. Most of them spend Sunday evening — or 06:00 Monday — pulling the same numbers into the same Notion page or board-pack template. None of them want to. The numbers all live in Xero already; the friction is that no single screen composes them, and every one needs context the standard reports do not provide. The skill version, end-to-end: - **Trigger.** Cron at 06:00 Monday in the org's timezone, or webhook on bank line settlement so the briefing always reflects what cleared overnight. - **Reads.** `BankSummary` across every operating account (multi-currency normalised to the org's base currency at today's rate), `ProfitAndLoss` MTD and YTD against the active `Budgets` version, `AgedReceivables` and `AgedPayables` summaries, the last 13 weeks of `BankTransactions` for trend, the live `Contacts` with overdue balance > org threshold, the org's tracked `BankFeeds` for any account flagged as "operating cash". - **Computations.** Cash position across all accounts. 13-week cash forecast composed from authorised AR (timed by each contact's historical days-to-pay), authorised AP (timed by each supplier's payment terms), recurring payroll/rent/loan-repayment lines, and the remaining budget commitments for the period. Gross and net margin trend vs the previous four weeks. DSO, DPO, and the cash conversion cycle. Top five MoM movements, called out with the specific transactions that drove them. Runway in months at the current trailing-three-month burn. - **Output.** A one-page narrative briefing the CFO reads in four minutes. Every figure is a Xero deep link — click "AR aged > 60 days at £180k" and you are in the Xero contact view, filtered to the right two contacts, in the right currency, signed in as your own user. A worked example. A series-B SaaS CFO with £4.2M ARR and a 22-month cash runway used to spend Sunday evening rebuilding the same picture by hand. Their Monday briefing now reads: *"Cash £6.8M (down £210k WoW, in line with the bi-monthly payroll cycle). MTD revenue £342k against £395k budget — 87% of run-rate target, gap concentrated in two stalled enterprise renewals. Top spend movement: AWS bill up £18k vs prior month, driven by a 30% increase in RDS storage on the staging account (link to bill, link to contact). AR aged > 60 days at £180k concentrated across two enterprise customers — Customer X £120k, Customer Y £60k — Customer X chase email already drafted by the AR follow-up skill, awaiting your approval. Runway 21.4 months at trailing-three-month burn — down 0.3 months on prior week."* The CFO reads it on the train. The Monday board call starts with the answer instead of the prep. ### 2. Live project P&L and budget variance The workflow that earns the agents their place on a project-led business — agencies, construction firms, professional services, anyone running 20+ live engagements where the difference between profit and loss is whether anyone notices a cost overrun before week eight. The standard Xero answer for project tracking is the two `TrackingCategories` (most teams use them as project + department) and, for teams on the right plan, the Xero Projects API with timesheets and per-project estimates. Both surface raw data. Neither composes a live, ranked, traffic-light view of the portfolio. The skill, per project: - **Reads.** All `Invoices` tagged with the project's `TrackingOption` for revenue. All `Bills` tagged with the same option for direct cost. `TimeEntries` from the Xero Projects API — converted to cost at each user's loaded rate — for labour cost when timesheets are the source of truth. The `Budgets` endpoint for the project's current-version budget allocation across revenue, cost, and margin. The project's `Estimate` from Xero Projects, if present. - **Dimension hygiene.** The skill resolves `TrackingOption` by ID, not by name; option renames in Xero never silently break the variance number. If a tagged transaction's option ID has been deleted (rather than renamed), it surfaces in an "orphaned tagging" tray rather than disappearing from the totals. - **Computations.** Budgeted revenue vs actual revenue with % consumed. Budgeted direct cost vs actual direct cost with % consumed. Gross margin in absolute and % terms vs the budgeted margin. Timeline progress (days elapsed vs project duration) overlaid on cost progress so a project burning 90% of budget in 50% of timeline is flagged. Unbilled revenue: time entries logged but not yet invoiced, plus milestone deliverables passed but unbilled. - **Surface.** Every active project as one row in a portfolio view, traffic-light coloured: green (on track), amber (>75% of budget consumed faster than timeline), red (>90% consumed or already over). Click any row for the underlying Xero invoices, bills, time entries, and the proposed corrective actions ("raise £18k of unbilled time on milestone 4", "investigate £6k subcontractor bill double-tagged across projects A and C"). A worked example. An 80-person creative agency CFO running 80 active client projects: in the first weekly run, the skill flagged 12 projects where actual costs were over 90% of budget with under 60% of the work complete (the scope-creep early-warning), and 6 projects where revenue should have been recognised but no invoice had been raised — £94k of unbilled revenue surfaced and converted into Stripe-ready invoices that week. By month three, the average time from "project crosses 75% of budget" to "PM intervention" had dropped from eleven days (catching it in the month-end PDF) to one (catching it in Tuesday's portfolio view). ### 3. Customer and project portfolio profitability The workflow that turns "we know revenue per customer" into "we know contribution margin per customer." Xero gives you the first cheaply; almost no SMB or mid-market does the second, and the answers are often surprising in directions that change the business. The skill: - **Trigger.** Monthly, after the prior period is locked. Or on demand for board-prep season. - **Reads.** Twelve months of `Invoices` (revenue) grouped by `Contact` and, where applicable, by project `TrackingOption`. Twelve months of `Bills` allocated to customer-serving cost centres or projects. Direct cost-of-service estimates from a customer-defined cost-allocation model (loaded labour, infra per seat, support contact-rate × volume — the model is part of the skill config and version-controlled like any other artefact). For agencies and PS firms with Xero Projects, the per-project labour cost from `TimeEntries`. - **Computations.** Revenue per customer (and per project). Contribution margin per customer once true cost-of-service is modelled. Pareto curve — top 10 customers as a % of revenue, of margin, of receivables risk. Customer concentration risk (top-3 share, HHI). Bottom-quartile customers ranked by negative margin — the loss-leaders the company is paying to serve. Top-quartile flagged for expansion. - **Surface.** A ranked customer list with a margin column nobody had before, drilling into the contributing invoices, bills, and time entries. A flagged action list: "renegotiate", "graceful churn", "expand", "investigate". A worked example. A six-person services business with £2.1M revenue: skill ranked 47 customers by contribution margin and identified 11 in the bottom quartile collectively contributing **−£42k margin** (loss-making once true cost-of-service was modelled — three of them had been on the books for years, accepted as "good logos"). Three were renegotiated upward, three were graceful churns, the rest were re-scoped. Net effect after 90 days: +£68k annualised contribution and a 4-point improvement in blended gross margin. Nothing about the data was new — it was always in Xero. The composition was new. ### 4. Multi-entity consolidation for group CFOs Xero's multi-tenant model — one OAuth grant authorising access to many `Xero-tenant-id` orgs — is built for accounting practices, but it is the same primitive that lets a group CFO with a dozen OpCos run consolidation as a workflow rather than a project. The skill: - **Trigger.** Weekly cadence for the group flash, on demand for board prep, or on a `period locked` webhook from any constituent OpCo. - **Reads.** For each tenant in the group's connected list (one OAuth grant, one consent screen, many tenants), pull `ProfitAndLoss` and `BalanceSheet` for the period, `BankSummary` for cash, the tenant's chart of accounts so the skill can map to the group reporting chart, and a customer-maintained intercompany contact-pair table (which contacts in OpCo A are intercompany counterparties of which contacts in OpCo B). - **Computations.** Per-entity P&L, BS, and cash position normalised to the group reporting chart. Intercompany elimination using the contact-pair table — with mismatches (OpCo A's intercompany receivable does not equal OpCo B's intercompany payable for the same period) flagged as a warning, not silently netted. Group-level consolidated P&L, BS, and cash position. Per-entity contribution to group margin. Per-entity working capital trend with cash buffer days. Region or business-line cuts via consistent tracking categories applied across OpCos. - **Surface.** A consolidated view, with every group-level number clickable down to the originating OpCo and the underlying transactions. Variance flags ranked by materiality. The intercompany mismatch tray. A worked example. A 14-OpCo hospitality group whose monthly close used to take two weeks — manual export from each entity, manual elimination in a master spreadsheet, manual narrative for the board pack. The skill now produces the consolidated view at T+2 business days, with each line clickable down to the originating tenant. Two OpCos consistently operating below the group's 30-day cash buffer are flagged automatically every Tuesday — the CFO walks into the operating review with the answer instead of the prep work, and the conversation is about what to do about it rather than about what the number is. ### 5. AR follow-up as a working-capital lever Aged receivables follow-up is often pitched as a clerk's task. For a CFO, it is a working-capital instrument — the difference between a 28-day and a 38-day DSO on a £6M annual revenue book is roughly £165k of permanently tied-up cash. We frame the AR skill as a CFO dashboard with chase actions underneath, not the other way round. The skill: - **Trigger.** Daily AR sweep at 07:00, or webhook on `INVOICE` crossing the overdue threshold. - **Reads.** `AgedReceivablesByContact` from the Reports endpoint. For each overdue contact, twelve months of `Invoices` (paid and unpaid) and `Payments` against them, plus any `CreditNotes` and contact `Notes` indicating disputes. - **Computations.** Per-contact: average historical days-to-pay, current overdue exposure, dispute risk, payment-behaviour segment (prompt-payer slipped, chronic-late payer, dispute, large-balance). For the CFO view: total overdue balance, expected cash inflow next 30 days if every overdue invoice is collected on its historical pattern, total in active follow-up, total in dispute, the DSO impact of clearing the dispute pile. - **Actions.** Drafts a follow-up per segment in the org's voice with the specific invoices, due dates, and a payment link if Stripe/GoCardless are connected. Prompt-payer slips get a soft nudge; chronic-late accounts get a firm reminder with a payment plan offer; disputes route to the account owner with full context. The user approves a batch; the skill sends through the org's existing send domain, records the activity against the contact in Xero `Notes`, and schedules the next escalation. The CFO does not sit in the chase queue. They see a working-capital pane: *"AR overdue £312k. If we collect on historical patterns: £198k inflow next 30 days, £74k next 60. £40k stuck in dispute (3 contacts) — releasing it would cut DSO by 2.4 days. Active chases: 17 emails ready for review."* The skill never marks an invoice as paid — that stays a human action. What it does is turn a 90-minute weekly clerk task into a 10-minute CFO decision and a one-click batch approve. ### Bookkeeping workflows that compound underneath These do not headline a board pack, but they are the layer on which every CFO and PM workflow above depends — they are the reason the briefing in workflow 1 is trustworthy at 06:00 Monday. - **Bank reconciliation co-pilot.** Xero's stock "suggestions" engine is a string match. Our skill reasons over the contact's payment history, the open invoice list, the GL code that contact's bills usually go to, and the tracking category implied by the project the line clearly relates to. A two-person agency we shipped this for went from 4 hours/week of bank reconciliation to 25 minutes, with a 96% auto-approve rate after three weeks of contact-history learning. - **AP intake beyond Hubdoc.** Hubdoc captures bills; it does not code them, check them against supplier history, or flag the duplicates it routinely misses. The skill resolves the supplier against `Contacts`, applies the supplier's default account code and tracking category, queries the last 90 days of ACCPAY invoices for duplicate-bill risk, and posts a fully-coded `Status: DRAFT` ACCPAY invoice with the original PDF attached for review. - **Period-close hygiene with a final lock.** A structured, runnable checklist replacing the forty-item Google Doc: pre-close sweep (unreconciled lines, draft items, suspense balances, FX revaluations, depreciation runs), adjusting `DRAFT` `ManualJournal` proposals with calculation evidence, the standard report pack with variance narrative, and finally `POST /Organisation` with `PeriodLockDate` set so post-close edits require an audited unlock. A six-person finance team using this cut their close from nine business days to four. - **Multi-tenant advisor sweeps.** For accounting practices: the same skill across every tenant the OAuth grant authorises, in parallel, respecting per-tenant rate limits. Unallocated bank lines older than 14 days, draft invoices past issue date, drafted-and-forgotten manual journals, contacts missing default codes, GST/VAT inconsistencies, periods unlocked despite filed returns — all aggregated into one practice-wide dashboard with one-click jump into each client's Xero. ## The deployment posture that actually clears procurement Xero is, by definition, internet-facing. The integration architecture question is not "how does the agent reach Xero" — that is `https://api.xero.com` from anywhere — but "where do the OAuth refresh tokens, the agent's working data, and the LLM call live." We support two deployment shapes, both honouring the same control regime: - **Customer-cloud deployment.** The agent runtime runs in a small VPC inside the customer's own cloud account (AWS, GCP, Azure). Xero refresh tokens are stored in their secret manager (Secrets Manager, Secret Manager, Key Vault), encrypted with their KMS key. Outbound calls from the runtime are restricted to `api.xero.com` and the LLM endpoint of their choice. The model call is the only thing leaving the VPC, and for sensitive workloads we route to a self-hosted open-weights model in the same VPC, with no prompt or response ever leaving. - **Tallie-managed cloud.** The agent runtime and the per-customer working datastore run inside Tallie's managed cloud, segregated per customer. Refresh tokens are encrypted at rest under a KMS key the customer holds and can rotate; we never see the unwrapped token. The customer still picks the LLM: their own OpenAI / Anthropic / Bedrock contract, a self-hosted open-weights model in their cloud, or a model hosted by us on their behalf. Routing is per-task and reversible. In both shapes, four things remain the same: - **The OAuth scopes are the union of what enabled skills require, and nothing more.** The consent screen the user signs is the contract; we cannot widen it without re-consent. - **The LLM is a customer choice.** Per task. Reversible. Including "use ours, not yours." - **Skills, prompts, the run log, and the materialised drafts are owned by the customer**, stored where they want them — their object storage, ours, or both. - **Disconnect from Xero's connected apps screen and every skill stops working.** The kill switch is the user's, not ours. This is the part that matters for App Partner review and for any controller that has read a SOC 2 report end-to-end. "Customer-controlled AI" on a cloud-native ledger is a different architecture from on-prem, but the control claim — *the customer decides where the agent runs, which model it routes to, which scopes it holds, and how it touches data* — is identical. ## What this looks like on day 90 A finance team using Xero with this pattern in place is not doing AI. They are doing finance, slightly faster, with the senior people spending more time on the questions only they can answer. The visible changes line up with the people who actually open Xero: - **The CFO** opens a briefing pane on Monday morning instead of rebuilding it on Sunday night. Cash, runway, margin, top movements, AR risk, and the things that need their attention this week — composed from live Xero data, every figure click-through. The board pack is written from the same skill outputs and is "approve, edit, send", not "open spreadsheet, paste, format, repeat." - **The project leads** open a portfolio view that updates as bills and invoices land, not as the month-end PDF arrives. Projects edge into amber early enough to do something about it. Unbilled work surfaces on a Tuesday, not on the close call. - **The bookkeeper** approves batched reconciliations, drafted bills, and a chase queue that has already been segmented — and reserves their attention for the 4% of lines a human should be looking at. - **The CTO** has a one-page architecture diagram with "scoped OAuth," "idempotent writes," "DRAFT-first," "tokens in our KMS," and "customer-chosen LLM" on it. **The auditor** has the run log: who asked, which skill ran, which scopes were used, what `DRAFT` came back, what was approved, what `Idempotency-Key` was used, and which model produced any free-text fields. The general ledger is still in Xero. The chart of accounts has not moved. Hubdoc, Stripe, GoCardless, the payroll integration, and the practice's QBO bridge for the few non-Xero clients all work exactly as they did. Nothing was rebuilt. That is what "AI for Xero" looks like when it is taken seriously: not a replacement for the ledger, not a marketplace app that ignores its scopes, but a thin agent layer that finally lets a structured cloud back end be operated through a 2026 interface — and lets the CFO and the PMs get the answers they would build themselves if they had unlimited time. > **Building on Xero and want a closer look at this pattern?** This is the kind of engagement we are explicitly built for: forward-deployed engineers, customer-controlled AI runtime, skills authored against your own Accounting API surface and your own tenants. [Get early access](/#join) and we will set up an architecture call. --- # Kimi K2.6 Lands — and Why Open-Weights Frontier Models Change the Finance AI Calculus > An open-weights model that is competitive with the closed frontier on agentic coding is exactly the event LLM-agnostic, customer-controlled architectures were designed for. Here is what we think it means for finance buyers. Published: 2026-04-21 Updated: 2026-04-21 Author: Archie Norman (Founder, Tallie AI) Category: Architecture Tags: kimi-k2, open-weights, llm-routing, on-prem, finance-ai, model-market Canonical URL: https://tallie.ai/blog/kimi-k2-6 ## TL;DR - Kimi K2.6 is the first open-weights model competitive with GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro on the agentic and tool-using benchmarks that matter for enterprise work. - The structural change isn't any single benchmark score — it's that an open-weights model now sits at the frontier table on long-horizon, tool-calling tasks. - For finance teams this is a procurement and compliance event, not a developer-aesthetics one: open-weights means self-hostable, auditable, and free of per-token vendor pricing curves. - LLM-agnostic, customer-controlled architectures absorb this kind of release as a routing decision, not a re-platforming project. [Kimi K2.6 was open-sourced today](https://www.kimi.com/blog/kimi-k2-6). On Kimi's own benchmarks it sits in the same neighbourhood as GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro on the agentic and coding suites that actually matter for tool-using AI: Terminal-Bench, SWE-Bench Pro, OSWorld-Verified, BrowseComp. On a few of them it leads. Vendor benchmarks are vendor benchmarks. The number to focus on is not any individual score. It is the *category*: an open-weights model is now sitting at the same table as the closed frontier on the kind of long-horizon, tool-calling work that an enterprise agent actually does. That is a structural change in the market, and it is exactly the change we built Tallie's architecture to absorb. ## Why open-weights matters specifically for finance For most categories of software, "open-weights vs. closed API" is a developer-aesthetics argument. For finance, it is a procurement and compliance argument. A closed frontier model can only be consumed over an API hosted by the model vendor. That means your prompts, your tool calls, and the data the agent saw all leave your perimeter and land in someone else's logging pipeline — under whatever zero-retention contract you managed to negotiate. For a lot of finance estates, that is the whole reason "AI for finance" has stayed in the proof-of-concept stage. The CISO does not love it. The auditor likes it even less. An open-weights model that is genuinely competitive on agentic work changes that calculus. It can run **inside the customer's environment** — a VPC, a sovereign-cloud tenancy, an on-prem GPU cluster — with no model-vendor middleman in the data path at all. The model becomes a binary you deploy, not a service you call. For regulated finance, healthcare, public-sector, and defence buyers, that is the difference between "we'd love to" and "we can actually buy this." Until now the trade-off was real: you could have frontier capability, *or* you could have your-environment deployment, but not both. Today's release is one of the first credible shots at having both. ## What this proves about per-task model routing We have been arguing for a while — most recently in [LLM-Agnostic by Design](/blog/llm-agnostic-by-design) — that picking a single model provider for a finance function is a structural mistake. The model market moves faster than any procurement cycle. Last quarter's frontier model is this quarter's commodity. Locking your finance estate to one vendor's API is a bet that the relative ordering of the model market will hold for years. It will not. K2.6 is what that argument looks like in the wild. Six months ago, the only finance-grade options for an autonomous coding/reconciliation agent were closed APIs. This morning, that list grew. A team that built its agent against a single closed model has a re-platforming project to evaluate the new option. A team that built against a routing layer has a configuration change. This is the whole point of an LLM-agnostic architecture. Not the abstract elegance of it — the concrete fact that when the market moves, you move with it instead of around it. ## What we'd actually do this week If you run a Tallie deployment, the practical move is straightforward: 1. **Add K2.6 to the eligible-model list** for the workloads where the trade-offs make sense — typically long-horizon agentic tasks, multi-step reconciliations, and code-shaped data manipulation. 2. **Run it as a shadow** behind your existing routing for a week. Compare not just accuracy but tool-call discipline, recovery behaviour on errors, and total cost per workflow. Vendor benchmarks do not capture any of these. 3. **Decide per skill**, not per platform. Some skills will be net-better on K2.6, some will not. Routing per task is exactly so this is a normal Tuesday decision and not a board memo. 4. **If you have a regulated workload that has been blocked on "no closed APIs in the data path,"** this is the model to run in your VPC or on-prem footprint. Talk to your security team about it now rather than after a quarter of evaluation. If you do not run a Tallie deployment, the more important takeaway is the procurement question this raises about whatever you *do* run: can your AI vendor add this model to your routing tonight, in your environment, without a re-platforming project? If the answer is "we'll add it to our managed offering once we've validated it," you are not running a control layer. You are renting a model that someone else picked for you. ## The honest caveats This is news commentary, not a sales pitch, so the caveats matter. - **Weights licence.** Open-weights does not always mean "free for any commercial use." Read the licence before you assume your deployment is covered. Some open-weights models carry use restrictions, redistribution rules, or revenue-threshold clauses that change the procurement story. - **Deployment footprint.** A frontier-class model is a frontier-class infrastructure commitment. The hardware, networking, and operational maturity needed to actually run one of these in production is non-trivial. "On-prem" is a serious project, not a checkbox. - **Training-data provenance.** For regulated buyers, the question of *what data the model was trained on, by whom, in what jurisdiction* is a real one. Open weights make the model deployable in your environment; they do not, by themselves, answer the provenance question. Expect your CISO and your DPO to ask. Have an answer ready. - **Vendor benchmarks are vendor benchmarks.** Until the open-source evaluation community has had a few weeks with the model, the public scores are the floor of what to trust, not the ceiling. None of those caveats kill the thesis. They are the work that makes adopting a new frontier model a serious engineering exercise rather than a Slack message. ## The broader pattern Every six to nine months for the last few years, the model market has produced an event that materially changes what the best available model is for some category of work. Each time, the teams that benefit are the ones whose architecture treats "which model" as a routing decision rather than a foundational one. The teams that pay are the ones who wired their product to a single vendor's API and now have to decide whether to re-platform or fall behind. K2.6 is one of those events. It will not be the last one this year. The question is not whether you like Kimi. The question is whether your AI architecture lets you take advantage of the next one of these — and the one after that, and the one after that — without a project plan each time. For a finance function, the answer to that question should be yes by default. That is what customer-controlled, LLM-agnostic infrastructure is *for*. --- **Update (next day):** the second half of this story landed almost immediately. See [CubeSandbox Lands — the Other Half of Customer-Controlled AI](/blog/cubesandbox) for the execution-layer counterpart to today's model-layer release. --- # Letting an LLM Write SQL Against Your Warehouse — Safely > If your agent can read the warehouse, the right question is not 'can it answer the question?' but 'what is the worst query it could run, and what stops it?' Published: 2025-11-25 Updated: 2026-04-22 Author: Archie Norman (Founder, Tallie AI) Category: Architecture Tags: warehouse, sql, safety, llm, data-access, finance-ai Canonical URL: https://tallie.ai/blog/warehouse-sql-safety ## TL;DR - The right question is not whether an LLM can answer a question against the warehouse, but what the worst query it could emit looks like and what physically stops it. - Defence-in-depth means a read-only role, a query planner with row and column policies, statement-level limits (timeout, byte scan cap), and an allow-list of schemas — not just a clever prompt. - Treat the agent as an untrusted user of the warehouse and design for that, the same way you would for any other application connecting via JDBC. Most "AI for finance" pitches contain, somewhere in the middle, a slide about the agent reading from your warehouse. The slide is usually a green checkmark next to the word "Snowflake" or "BigQuery." The CFO nods. The CISO does not. (For the broader argument that finance teams should keep the data, the model, and the deployment posture in their hands, see [Customer-Controlled AI](/blog/customer-controlled-ai).) The CISO is right to hesitate. Letting a language model emit SQL against the system of record for your business is the kind of capability that earns a vendor the demo and loses them the procurement cycle — unless the vendor has thought hard about what sits between the model and the query planner. Most have not. This post is a sketch of how we think about it. No vendor pitch — just the model. ## The wrong question The wrong question is "can the LLM write good SQL?" Frontier models can write good SQL. They can also write a `DELETE` with no `WHERE` clause, a sixteen-way join that locks a production table, an exfiltration query that selects every row of a payroll fact table, or a perfectly valid statement against a schema they were never supposed to see. Capability is not the safety story. The safety story is what happens when the model gets it wrong, gets it right against the wrong table, or gets it right but is being manipulated by the user in front of it. The right question is **"what is the worst query the agent could run, and what would stop it?"** That is the question this post is about. ## The threat model in plain English Three threats matter, in order of how easy they are to ignore: 1. **Mutation by accident.** The agent generates a `DELETE`, `UPDATE`, `TRUNCATE`, `DROP`, `MERGE`, `GRANT`, or `CREATE` against your warehouse — because the user asked it to "clean up duplicate rows," because a model regression made it less cautious, or because the prompt contained an instruction it should have ignored. 2. **Mutation by injection.** A user pastes content from an email, a CRM note, or an uploaded PDF that contains text designed to look like an instruction to the agent. The agent treats it as one and emits a query the user never asked for. 3. **Over-reading.** The agent issues a syntactically valid, read-only query that returns far more than it should — a `SELECT *` against a table containing PII, a join that pulls customer data the requesting user is not entitled to see, or a query that fans out into millions of rows and lands in a model context window. Threats 1 and 2 are about *what kind of statement* the agent can issue. Threat 3 is about *what data* a permitted statement can return. The defenses are different. ## Defense layer 1: the connection itself is incapable The first and most important rule: the agent's database connection should be **structurally incapable of mutation**. Not "the prompt tells it not to mutate." Not "the tool description says read-only." Incapable. Concretely, that means: - A dedicated database role used only by the agent, with `SELECT` privileges on the schemas it is allowed to read and **no other privileges anywhere**. No `INSERT`, no `UPDATE`, no `DELETE`, no `CREATE`, no `GRANT`, no schema-modification rights, no `EXECUTE` on functions that mutate. - Sessions opened in **read-only transaction mode**, so even a privilege escalation bug or a misconfiguration cannot be used to write. - No access to system catalogs that expose credentials, no access to extensions that reach the network, no access to file-system functions. This is a database administration job, not an AI engineering job. It is also the layer most vendors hand-wave past. If the vendor cannot describe their agent's database role in one paragraph, they have not done it. The instinct to build mutation safety into the prompt is exactly the wrong instinct. Prompts are advisory. Database privileges are not. ## Defense layer 2: a SQL gate in front of the planner A read-only role stops mutation. It does not stop the other ways an LLM-generated query can go wrong: queries that are syntactically read-only but operationally dangerous, queries against schemas the agent should not see, multi-statement payloads designed to slip past simple checks. So the second layer is a **SQL gate** — a small, deterministic piece of code that sits between the model's output and the database, and refuses anything it does not like. The gate is not a model. It is parser-driven and explicit. A useful gate enforces, at minimum: - **Single-statement only.** A query must parse to exactly one statement. No semicolon-separated payloads, no piggy-backed `DROP` after a `SELECT`. The classic injection pattern dies at the door. - **Read-only statement types only.** `SELECT` and `WITH ... SELECT` are allowed. Everything else is rejected by AST type, not by string matching. ("Don't use regex to detect SQL keywords" is one of those rules that everyone agrees with and half of vendors break.) - **An allow-list of schemas and tables.** The gate knows which schemas the agent is permitted to read from, and rejects any query that references anything else. Adding a new table to the allow-list is a deliberate, reviewable change — not something the model can do at runtime. - **A row-count ceiling.** Every query is rewritten with a `LIMIT` (and, for warehouses that support it, a query timeout and a max-bytes-scanned cap). The agent does not get to decide how much data to return; the gate does. - **No DDL, no DCL, no procedural extensions.** No `CREATE`, no `GRANT`, no `CALL`, no warehouse-specific procedural blocks that can hide mutation behind a function call. Together, the connection layer and the gate layer mean that even if the model produces something unhinged, the *worst* outcome is a rejected query. Not a missing table. ## Defense layer 3: schema discovery before ad-hoc queries The third layer is about discipline, not enforcement. Even with a perfect gate, an agent that is guessing at table names will produce noisy, expensive queries — and a lot of "is this column called `customer_id` or `cust_id` or `account_no`?" thrash. The fix is to make schema discovery a separate, deterministic step. Before the agent issues any ad-hoc SQL, it has access to a curated description of the allow-listed schemas: tables, columns, types, descriptions, and example values. That description is *authored* — not auto-generated from the warehouse — so it reflects what the team actually wants the agent to use, in the language they actually use to describe it. This pattern has two effects. First, the agent stops guessing, which kills a long tail of failed queries and accidental joins. Second, the team has a reviewable artifact — "here is what the agent knows about the warehouse" — that can be edited, versioned, and audited. The line between "data the agent can use" and "data that exists" becomes a deliberate decision rather than an emergent one. The same versioning pattern shows up in [Skills, Not Prompts](/blog/skills-not-prompts) — both are runtime artifacts the customer owns and can review. ## Defense layer 4: a query log every CFO can read Every query the agent runs must be logged in a form a human can read in plain English: who asked, what skill ran, what statement was emitted, which tables were touched, how many rows came back, and which model produced the SQL. Stored alongside the agent's overall run log, it becomes the artifact you hand an auditor when they ask "what did your AI do, on what data, last quarter?" The log is not a safety mechanism on its own. It is what makes the other three layers *trustable*. If a query slips through that should not have, the log is how you find it. If a regulator asks how you are governing model access to the warehouse, the log is the evidence. We have seen a lot of AI systems where the auditing story is "we sample some queries to a separate logging system." That is not enough. Every statement, every time, with the context that produced it. ## What this rules out — and why that is the point Stack the four layers and there is a class of agent behaviour you can no longer ship: - An agent that "explores" the warehouse on its own, finding tables nobody added to the allow-list. - An agent that runs an `UPDATE` because a user asked it to "fix a wrong status." - An agent that returns a million rows because the model's `LIMIT` clause was ignored. - An agent whose actions on the warehouse cannot be reconstructed after the fact. That is the point. The capabilities removed by the four layers are exactly the capabilities a finance function does not want an LLM to have. What remains — a model that can compose well-formed, read-only, bounded queries against a curated set of tables, and whose every statement is logged — is genuinely useful. It is also the only shape of warehouse access that survives a real CISO review. ## What to ask a vendor If you are evaluating any AI tool that claims to read from your warehouse, four questions cut through the demo: 1. **What database role does the agent use, and what privileges does it hold?** "Read-only" is a posture, not a configuration. You want the role definition. 2. **What sits between the model's SQL output and the database — and what would it reject?** If the answer is "the prompt tells it not to do bad things," the answer is "nothing." 3. **Where is the list of tables the agent is allowed to read, and who controls it?** If the answer is "the model figures it out from the warehouse," that is a no. 4. **Show me the query log for the last week.** Per statement, with the context that produced it. If the vendor cannot show one, they do not have one. These are not gotcha questions. They are the minimum a finance function should ask before pointing any LLM at the system of record. The vendors that have done the work will answer them in a paragraph each. The ones that haven't will start talking about SOC 2. The bar for letting an LLM write SQL against a finance warehouse is high — and it should be. The good news is that it is a solvable engineering problem. The bad news is that "we trained the model to be careful" is not the solution. > **Going beyond reads?** The same defence-in-depth model extends to mutation APIs — the place agents start posting journals or updating analysis codes. We wrote up the SunSystems version of this in [Agentic Finance Workflows on SunSystems](/blog/agentic-workflows-on-sunsystems). --- # What 'Customer-Controlled AI' Actually Means for Finance > Most AI tools ask finance leaders to give up the data, the model, and the deployment posture in one go. There is another way. Published: 2025-11-23 Updated: 2026-04-22 Author: Archie Norman (Founder, Tallie AI) Category: Strategy Tags: customer-controlled-ai, finance-ai, data-sovereignty, cfo Canonical URL: https://tallie.ai/blog/customer-controlled-ai ## TL;DR - 'Customer-controlled AI' means the customer keeps three things: where the agent runs, which model it routes to, and how it touches their data — not just one of the three. - Most enterprise AI tools collapse all three decisions into a single procurement, which is why CFOs end up with vendor lock-in, data egress they can't audit, and a per-seat bill that escalates with usage. - The right default is read-only access against existing systems, deployment inside the customer's perimeter, and per-task model routing the customer can change without re-platforming. When a CFO asks an AI vendor "where does our data go?" they usually get one of three answers. None of them are very good. The first is "trust us, it's encrypted in transit and at rest." The second is "we have SOC 2." The third — most honestly — is "it goes to whichever model provider we route to that day, and we can't really tell you what they do with it." For a finance estate, none of those rise to the bar. The general ledger, payroll, customer contracts, and forecasting models are not just sensitive data — they are the data that determines whether the rest of the business is viable. They do not belong in someone else's training corpus, and they do not belong in a vendor's data lake. So we built Tallie around a different default: **customer-controlled execution**. ## What "control" means in three planes It is easy to use the word "control" loosely. We try to be specific. There are three things a finance team needs control over, and most AI products give you none of them. 1. **The data plane.** The agent reads from your warehouse, your lake, and your accounting and ops systems *in place*. There is no central AI lake. No nightly sync into a vendor cluster. No prompt-by-prompt copy of your trial balance into a model provider's logging pipeline. 2. **The model plane.** You decide which model handles which task. OpenAI for one workload, Anthropic for another, an open-weights model for the regulated ones. Switch when pricing or capability changes — without re-platforming. We make the architectural case for that in [LLM-Agnostic by Design](/blog/llm-agnostic-by-design). 3. **The execution plane.** The agent worker runs where you say it runs. Managed cloud, your VPC, on-prem. The compute boundary is yours. The phased path for getting there is in [On-Prem AI for Finance](/blog/on-prem-ai-for-finance). A "copilot" gives you exactly zero of these. You get a chat box and a prayer. That is fine for drafting an email. It is not fine for closing the books. ## The CFO's instinct is right When finance leaders push back on AI, they are often dismissed as cautious or behind. We think the instinct is correct. The job of a finance function is to maintain a defensible record of what is true. AI tooling that obscures where the data went, which model produced an answer, and what the model was actually allowed to do is not compatible with that job. Customer-controlled AI is not a productivity sales pitch. It is a way to make the answer to "what did the agent do, on which data, with which model" inspectable — by you, by your auditors, by your regulators. ## What this looks like in practice Concretely, a customer-controlled deployment of Tallie has a few shapes: - The agent worker runs inside your VPC or on-prem environment. Data and compute stay inside your perimeter. - Connectors are scoped. Read-only by default. The agent can read the warehouse, but it cannot create journal entries unless you have explicitly approved that capability. - Every run is logged: the prompt, the tool calls, the data the agent saw, and which model produced the output. Finance and IT can replay any decision. - Skills — the encoded, versioned definitions of how your team runs a recurring process — make the agent's behaviour predictable. The same close runs the same way each month. More on that delivery model in [Skills, Not Prompts](/blog/skills-not-prompts). None of this slows the team down. If anything, it is the lack of these guarantees that has been keeping finance teams from using AI seriously. ## Where to start If you are evaluating AI for finance, ask the vendor three questions: 1. Can the worker run inside our environment? 2. Can we choose the model provider per workload? 3. Where do prompts and outputs land, and for how long? If the answers are "no, no, and on our servers," you are not buying a control layer. You are buying a copilot, and copilots are not the right shape for a finance function. The bar is rising. Customer-controlled AI is what finance leaders should expect — and what their auditors and CISOs will start to require. We would rather build to that bar from day one than retrofit it later. --- **Coming soon — engineering deep-dive: *Separating the control plane from the execution plane.*** How we split orchestration, state, and audit (the control plane) from the long-running workers that actually call models and touch your data (the execution plane) — and why that split is what makes a "your VPC" or "on-prem" deployment a configuration choice rather than a re-architecture. --- # Skills, Not Prompts: How Forward-Deployed Engineers Codify Finance Processes > A skill is a versioned, reviewable definition of how your team runs a recurring finance process — authored with you, not handed to you in a docs link. Published: 2025-11-22 Updated: 2026-04-22 Author: Archie Norman (Founder, Tallie AI) Category: Implementation Tags: skills, forward-deployed-engineering, finance-ops, implementation Canonical URL: https://tallie.ai/blog/skills-not-prompts ## TL;DR - Prompting is fine for individuals; it's a terrible delivery model for a finance, ops, or sales process that needs to run the same way every time. - A skill is a versioned, reviewable definition of how a recurring process is executed — month-end close, pipeline hygiene, RevOps reporting — authored alongside a forward-deployed engineer, not handed over in a docs link. - The forward-deployed engineer is the lever: they wire up data sources, encode the team's actual process, and hand back something that runs deterministically against the same skill on every cycle. The most common AI failure mode we see in finance teams is not technical. It is operational. A platform shows up, the team is told to "prompt it," and within a quarter the tooling has been quietly abandoned because nobody can get it to do the same thing twice. That is the failure mode we designed Tallie to avoid. Two ideas do most of the work: **skills** and **forward-deployed engineers**. Together, they replace prompting with something a finance function can actually live with. ## The trouble with prompting as a delivery model Prompting is a fine way for an individual to coax an answer out of a model. It is a terrible way to deliver a finance process. Three reasons: 1. **Variance.** The same question, asked twice, can produce two different answers. That is fatal for any output a CFO has to sign. 2. **Tribal knowledge.** Whoever wrote the best prompt holds the institutional memory of how that workflow runs. When they leave, the workflow leaves with them. 3. **No review surface.** A prompt is a paragraph in a chat box. There is no diff, no version, no approval, no rollback. Compare that to how every other artifact your finance team produces is governed. If you are running a serious finance function on prompts, you are running it on tribal knowledge in a chat window. That is not a controlled environment. ## What a skill is A skill, in Tallie, is a versioned definition of how a specific finance process runs. Concretely, a skill includes: - **The trigger.** When does this run? On a schedule, on demand, in response to an upstream event? - **The data sources.** Which connectors does it touch? With what scope and what permissions? - **The procedure.** What are the steps, in order, with their dependencies and acceptance criteria? - **The model policy.** Which model providers are eligible to run which steps, and what is the routing logic? See [LLM-Agnostic by Design](/blog/llm-agnostic-by-design) for why we treat that as a per-task decision rather than a platform choice. - **The outputs.** What artifacts does it produce, in what format, with what validation? - **The audit hooks.** What gets logged, who can review it, and how is it stamped to the record? It is a definition, not a prompt. It is reviewable, diff-able, version-controlled, and ownable by the customer. And it runs the same way every time — because that is the whole point of a finance process. A skill is not magic. It is what "how we do month-end close here" looks like when you take it out of a runbook and a few people's heads and put it into a system the agent can execute. ## Why we send forward-deployed engineers Skills are the artifact. The hard part is *authoring* them. Most finance teams know how their close works in the way that any expert knows their domain — well enough to do it, not necessarily well enough to write it down precisely. Asking the team to author skills from scratch, against an empty platform, is a setup for failure. We have watched it happen at other vendors. The platform is good. The team is good. The translation between them is the bottleneck. So we send a forward-deployed engineer. They embed with the finance team, often for the first month and intermittently after. Their job is to: - Sit through the actual close, the actual board pack prep, the actual reconciliations. Watch what *really* happens, not what the runbook says. - Wire up the data sources and connectors with the right scopes. Read-only first — see [Letting an LLM Write SQL Against Your Warehouse — Safely](/blog/warehouse-sql-safety) for the safety model we apply at the connector layer. - Author the initial set of skills from our templates — adapted to your chart of accounts, your entities, your terminology, your sign-off chain. - Hand the skills over to your team in a state where they can be reviewed, modified, and owned internally. This is not a managed service. It is a one-time investment in translation. After it, your team owns the skills, can edit them, can version them, can decide which model providers to route to. The agent is yours to run. ## The economics People sometimes expect this model to be expensive. In practice it is the cheapest way to actually land an AI deployment in a finance function — because the alternative is not "cheaper deployment," it is "no deployment that survives a quarter." The right comparison is not "FDE engagement vs. SaaS-only platform." It is "FDE engagement that produces a working, owned set of skills" vs. "SaaS-only platform that gets quietly shelved after the first month-end." We have seen the second outcome enough times to optimise hard against it. ## A worked example A close skill we author with most customers in week three or four typically: - Pulls trial balance and sub-ledger data from the warehouse. - Reconciles bank, AP, AR, and intercompany against defined tolerances. - Flags variances against prior period and forecast. - Produces a draft commentary, in your house style, that goes to the controller for review. - Logs every step, every model call, every data point used. The first version is authored by the FDE in collaboration with the controller. By month two, the controller is editing the skill themselves. By month three, the team has authored two more skills on their own — typically board pack prep and a recurring lender update. That is the path we want every customer on. Skills, owned by you, authored with help, governed by default. Prompting is fine for individuals. Skills are how a finance function actually adopts AI without giving up control. --- **Coming soon — engineering deep-dive: *Skills as runtime artifacts, not prompts.*** How a skill is actually packaged, shipped with the deploy bundle, and loaded by the worker at runtime — kept structurally separate from application code so it can be reviewed, versioned, and rolled back the same way any other production artifact is. --- # On-Prem AI for Finance: A Practical Path for Regulated Teams > VPC and on-prem are not exotic any more. Here is what a phased deployment actually looks like for a finance function with real residency, regulator, and audit constraints. Published: 2025-11-21 Updated: 2026-04-21 Author: Archie Norman (Founder, Tallie AI) Category: Deployment Tags: on-prem, vpc, regulated-industries, deployment, governance Canonical URL: https://tallie.ai/blog/on-prem-ai-for-finance ## TL;DR - VPC and on-prem AI deployments are not exotic — for regulated finance teams (banks, insurers, healthcare finance, public sector) they are increasingly the only credible deployment posture. - A managed-cloud AI deployment crosses three trust boundaries (yours, the vendor's, the model provider's); a VPC or on-prem deployment collapses that to one. - The phased path is: read-only connectors against existing systems first, then governed write capabilities per skill, then model routing including self-hosted open-weights models — all inside the customer perimeter. For a long time, "AI in finance" implicitly meant "data goes to a SaaS vendor, gets routed to a model provider, and we hope the contract holds." For a lot of teams — banks, insurers, healthcare finance, parts of the public sector — that has been a non-starter from the beginning. It does not need to be. **On-prem and VPC deployments of finance AI are practical today**, and for regulated environments they are increasingly the only credible path. This post is a sketch of what that deployment actually looks like, and how to phase it. ## What changes when the worker runs in your environment A managed-cloud AI deployment usually has the platform sitting in the vendor's tenancy, calling out to model providers, with the customer's data flowing in and out of the vendor's perimeter. Even with strong contracts, your data crosses three boundaries: yours, the vendor's, and the model provider's. A customer-controlled deployment changes this. The agent worker — the component that orchestrates skills, calls connectors, and routes to models — runs **inside your VPC or your data centre**. Concretely: - Connectors to your warehouse, lake, and accounting systems are on your network. Source data does not leave the perimeter. - Model calls to self-hosted weights stay inside the perimeter entirely. - Model calls to external providers (OpenAI, Anthropic, etc.) leave the perimeter only when *you* have approved that workload to do so, with the residency, retention, and processing terms negotiated upstream. - Logs, audit trail, and skill definitions live on your storage. Your SIEM, your retention policy, your eDiscovery posture. This is not a bolt-on. It is the deployment shape we built the platform around, because it is the only one that works for the customers we built it for. ## The control surface, in plain terms A regulated-environment deployment of Tallie has four control surfaces that a managed deployment does not need: 1. **Network policy.** What egress is the worker allowed to make, to which endpoints, on which ports? Default-deny, allowlisted, logged. 2. **Model policy.** Which model providers are eligible for which workloads? For sensitive workloads, often only self-hosted weights are eligible. The router enforces this; it is not an organisational guideline. The architectural argument for treating this per task is in [LLM-Agnostic by Design](/blog/llm-agnostic-by-design). 3. **Data policy.** Which connectors are read-only, which can write, and which capabilities are simply unavailable in this environment? Capability-scoped from day one — see [Letting an LLM Write SQL Against Your Warehouse — Safely](/blog/warehouse-sql-safety) for how that's enforced at the connector layer. 4. **Audit and observability.** Every run, prompt, tool call, model invocation, and output is recorded into your logging stack — not the vendor's. Your audit team owns the trail. These are the things a CISO and a head of risk will ask about. Having clear answers, with defaults that are conservative and explicit, is what makes the deployment shippable. ## A phased plan Big-bang AI deployments do not work in regulated finance functions. The compliance review alone tends to outlast the enthusiasm. We recommend a four-phase plan, and we typically run it with the customer over four to eight weeks. **Phase 1 — Scope.** A forward-deployed engineer embeds with the finance and IT teams. We map data sources, the recurring processes worth encoding as skills, and — critically — the capabilities the agent should *never* have. The output of this phase is a written deployment shape that security, finance, and IT all sign. **Phase 2 — Deploy.** The agent worker is deployed to the agreed environment (managed cloud, VPC, or on-prem). Read-only connectors are wired in. Network and model policies are applied. No skills run yet. The deployment is reviewed end-to-end against the security checklist. **Phase 3 — Author skills.** Read-only finance answers land first — analysis, recurring reporting, variance commentary. Skills are authored from our templates and adapted with the controller. Each skill is reviewed before it is allowed to run; outputs are reviewed in the first cycle before going to the broader team. **Phase 4 — Go live.** Staged rollout to the finance team. Read-only analysis and recurring reporting first. Broader capabilities — write actions, external integrations, anything with operational impact — are added on the customer's terms, one capability at a time, each with its own review. The point of the phasing is not to slow things down. It is to make every step approvable, so the deployment does not stall halfway through compliance review. ## What "regulated" actually constrains A few patterns we see specifically in regulated environments: - **Data residency.** Some workloads must run on infrastructure in a specific jurisdiction. Self-hosted weights and a regional VPC handle this cleanly; cross-border SaaS does not. The release of an open-weights model that competes with the closed frontier is what makes this practical now — see the [Kimi K2.6](/blog/kimi-k2-6) writeup. - **Auditor reproducibility.** "Show me how this number was produced" needs a deterministic-enough answer. Skills, with versioned definitions and per-run logs, provide it. Ad-hoc prompting does not. - **Capability allowlists.** Regulators increasingly want to see that AI cannot take certain actions, full stop — not "is unlikely to," but "is structurally unable to." Capability-scoped connectors are how you make that statement true. - **Vendor portability.** Some teams need a written exit plan: if the vendor goes away, can the customer continue running the agent on their infrastructure with their model providers? Customer-controlled execution is what makes that answer "yes." These constraints are not blockers. They are design inputs. Building to them from the start is dramatically cheaper than retrofitting them onto a SaaS platform that was designed without them in mind. ## The takeaway On-prem and VPC AI for finance is not the bleeding edge any more. It is the credible default for any team with regulator-, residency-, or audit-driven constraints. The work is in the deployment shape and the phasing — not in waiting for the technology to catch up. It already has. If your finance team has been told that "real" AI tooling means SaaS-only and a single model provider, that advice is out of date. The shape of the deployment is now a choice. Make it on your terms. --- # LLM-Agnostic by Design: Why Finance AI Shouldn't Be Locked to One Vendor > Routing per task — not per platform — is how you keep the cost curve, the capability curve, and the procurement story under your control. Published: 2025-11-20 Updated: 2026-04-22 Author: Archie Norman (Founder, Tallie AI) Category: Architecture Tags: llm-routing, llm-agnostic, model-selection, architecture Canonical URL: https://tallie.ai/blog/llm-agnostic-by-design ## TL;DR - 'LLM-agnostic' is a property of the system, not a marketing claim — most products that say it have a single integration with optionality on the roadmap. - True agnosticism means routing per task across hosted, open-weights, and self-hosted models, with the routing rules versioned alongside the rest of the application. - For finance, the payoff is three-fold: the cost curve stays under your control, the capability curve tracks the market without re-platforming, and procurement keeps a real BATNA against any single vendor. The model market is moving faster than any procurement cycle. Last quarter's frontier model is this quarter's commodity, and next quarter's regulated workload will probably want something else entirely. Picking a single LLM provider for your finance function and wiring everything to it is a structural mistake. Tallie is **LLM-agnostic by design** — not as a marketing line, but as a deployment property. Here is what that means in practice and why it matters more for finance than for almost any other function. ## "LLM-agnostic" is a property of the system, not a slide A lot of products claim to be model-agnostic. What they usually mean is: "we currently call OpenAI, and one day we might call Anthropic." That is not agnosticism. It is a single integration with optionality on the roadmap. A genuinely LLM-agnostic system has three properties: 1. **Per-task routing.** A long-context summarisation task can go to one provider; a structured-output reconciliation can go to another; a sensitive board-pack draft can stay on a self-hosted open-weights model that never leaves your VPC. The router is part of the platform, not part of a single workflow. 2. **Replaceable provider, stable contract.** When you swap GPT-style provider A for provider B, the skills, audit trail, connectors, and access controls do not change. The model is a runtime detail, not the architecture. 3. **Cost and capability transparency.** You can see, per workload, which model handled the call, what it cost, and what the latency looked like. You can reroute based on those numbers, not on a vendor's promise. Without all three, you do not have agnosticism — you have a thin abstraction over a single bet. ## Why finance, specifically Finance workloads are unusually heterogeneous. A single month-end might involve: - **Long-context analysis** of board materials, contracts, and policy documents. - **Structured extraction** from invoices, statements, and trial balances. - **Numerical reasoning** over reconciliation deltas and variance analysis. - **Drafting** in a tightly controlled tone — board commentary, audit responses, lender updates. - **Sensitive workloads** — payroll commentary, restructuring analysis, M&A — where the data should not leave the perimeter at all. No single model is best at all of these. The frontier model that writes the cleanest commentary may be slow and expensive for batch reconciliations. The cheapest extraction model may not be defensible for audit-touching outputs. The right answer is to route — and to keep routing as a first-class capability of the platform. ## The procurement and risk angle There is a second reason LLM-agnosticism matters that has nothing to do with capability. If your AI strategy is built on top of a single vendor, you have inherited that vendor's: - **Pricing curve.** Token costs change. Sometimes a lot. - **Roadmap risk.** Models get deprecated. Behaviour changes between versions in ways that break tightly-coupled workflows. - **Compliance posture.** A change in a vendor's data handling, training policy, or regional availability can disqualify them from your stack overnight. - **Geopolitical exposure.** Some customers cannot route certain workloads through certain jurisdictions, full stop. A finance function that has hardwired itself to one provider is one provider decision away from a forced re-platforming. Agnosticism is, more than anything, an option-value play. ## What to demand from a vendor If you are evaluating an AI platform for finance, the bar should be: 1. Show me how a workload is routed today, and how I would route it differently tomorrow. 2. Show me what happens to a skill when I swap the underlying model — does it break, does it behave the same, what changes? 3. Show me self-hosted as a first-class option, not a roadmap item. 4. Show me the per-call audit log: which model, which prompt, which tool calls, which output. If the answer to any of those is hand-wavy, the platform is more locked-in than it sounds. ## The Tallie default For Tallie deployments, the default is a small, governed router that maps task type to model provider, with the customer choosing the providers and the policy. Read-only finance answers, by default, route to a provider the customer has already approved on their data residency and processing terms. Sensitive workloads route to self-hosted weights inside the customer's environment. This is not exotic. It is what an LLM-agnostic system looks like when you build it for a finance team rather than for a general developer audience. And it is what keeps the model market a tailwind for you, instead of a source of lock-in. --- **See also:** the architectural argument here was reinforced almost immediately by two consecutive open-source releases — [Kimi K2.6](/blog/kimi-k2-6), an open-weights model competitive with the closed frontier on agentic coding, and [CubeSandbox](/blog/cubesandbox), a Tencent-released open-source MicroVM sandbox compatible with the E2B SDK. Together they make the "open-weights model + open execution sandbox + your-environment deployment" stack buildable end-to-end without a closed vendor in the data path. --- # Why Your Operational Reporting Is Lying to You > Ledger reality, pipeline reality, and ops reality each live in their own silo. Here's the disconnect — and how an AI control layer fixes it without ripping out your stack. Published: 2025-11-18 Updated: 2026-04-23 Author: Archie Norman (Founder, Tallie AI) Category: Operational Intelligence Tags: finance, operations, sales, revops, project-accounting, ai-platform Canonical URL: https://tallie.ai/blog/operational-reporting-disconnect ## TL;DR - Operational reporting fails because three independent realities — ledger, pipeline, and ops — never reconcile in real time, not because any single team is wrong. - Stitching them together at the BI layer is the wrong fix; an AI control layer that reads each system in place lets reconciliation happen on demand without a central data lake. - The win for finance, sales, and operations is the same: questions like 'which accounts are bleeding margin?' or 'which deals look healthy on paper but aren't getting delivered?' become single queries instead of cross-team archaeology. Every operator knows the feeling. You look at the numbers — revenue is up, the bank balance looks healthy, the pipeline shows coverage — but you have a nagging suspicion that some accounts are bleeding margin, some deals are stalled in ways the CRM doesn't show, and some delivery teams are quietly underwater. You don't know which. ## The Disconnect The problem isn't your accountant, your AE, or your project manager. It's the fundamental disconnect between three views that almost never reconcile in real time: - **Ledger reality** — what's invoiced, paid, and recognised. - **Pipeline reality** — what's qualified, weighted, and committed. - **Operational reality** — what work is actually happening, by whom, on what. In a finance ledger, "Cost of Goods Sold" looks like a single payroll line. The ledger doesn't know that Sarah spent 40 hours on a non-billable pitch and 10 hours on a paid engagement. In a CRM, a deal at 80% probability looks healthy — the CRM doesn't know that the customer hasn't replied to four emails. In a project tool, an "on-track" status looks fine — the project tool doesn't know the client's PO never arrived. > "You can't manage margin if you can't measure effort. You can't manage pipeline if you can't measure the silence between meetings. And you can't do either if your timesheets, invoices, and CRM each live in their own silo." ## The Three Pillars That Have to Reconcile To run a business cleanly across finance, ops, and sales, three data points have to triangulate continuously: - **Booked revenue** — what you sold (CRM, contracts). - **Actual cost and effort** — time, expenses, delivery state (ops, project tools, time tracking). - **Recognised revenue and cash position** — what you've earned and collected (finance ledger, bank feeds). The teams that get this right read all three from the same model. Most teams don't, because the integrations don't exist, the integrations exist but break weekly, or the data is technically there but nobody's authored the queries that make it useful. ## What Tallie Does About It Tallie is an AI control layer that reads from your accounting, CRM, warehouse, and ops systems in place — read-only by default, no central AI lake, nothing copied into a vendor's environment. A forward-deployed engineer embeds with your team to author the *skills* that encode how you actually reconcile these three views: a finance close skill, a pipeline-hygiene skill, a project-margin skill. Authored once, the agent runs them the same way every time — see [Skills, Not Prompts](/blog/skills-not-prompts) for why this matters more than a chat interface — on the model you've chosen ([LLM-Agnostic by Design](/blog/llm-agnostic-by-design)), on the infrastructure you've chosen. The finance lens is the easiest one to demonstrate, which is why this post leads with it. But the same disconnect breaks RevOps reporting and ops dashboards — and the same control layer fixes both. --- **Engineering deep-dive: *[Letting an LLM Write SQL Against Your Warehouse — Safely](/blog/warehouse-sql-safety).*** The reconciliation work above only earns the word "safely" if the agent's access to your warehouse is gated, audited, and provably read-only. We walk through the layers we put between a model and a query, and what to demand from any vendor that claims their agent can read your data "in place." ---