Join us at Tekion One - Automotive's Premier Data & AI Conference

Join Waitlist

Ten Principles for Production AI Systems

By Binu Mathew, CTO, Tekion — 2026-05-25

TL;DR: Tekion's CTO shares ten hard-won principles for building AI systems that actually work at production scale — not just in demos. The core idea: vibe coding (prompting AI without a formal spec) breaks down fast on complex, multi-file codebases. What works instead is spec-driven development, where humans define precise intent and AI generates the implementation. The ten principles cover the full stack: write specs before code, use AI to generate rule-based logic (then run rules at runtime instead of the model), build three independent safety layers, verify high-stakes decisions across multiple AI model families, place instructions at the point of action not the top of the document, approve agent plans not individual calls, build your eval set before your agent, monitor six quality signals continuously, surface every failure visibly, and treat token budgets as efficiency targets not hard limits. The connecting thread across all ten: every spec, test, and rule the AI writes today is a model call it doesn't make tomorrow — the system gets cheaper and more reliable over time, not more expensive.

If you read one thing. The defining shift of 2026 is that AI now writes the rule-based logic that runs underneath every AI system. Spec-driven development supplies the intent; AI generates the implementation and the rules; what compounds is the artifact corpus — specs, evals, rules, audit trails — under one Golden Thread. The ten principles below are how we operate.

Tekion is an AI-native company. Being cloud-native from the start is what made this possible — it gave us the foundation to embed AI into every layer of the platform, not as a feature bolted on top. At Tekion One this June, T1 is what we are unveiling — the AI-native interface that drives every dealer workflow: sales, finance and insurance, service, parts, accounting, fixed operations, and more. We have built the underlying AI-native platform over the last couple of years, and the ten principles below are the engineering discipline behind what runs through it.

To get here, Tekion went much further than adding AI to parts of the product. We embedded AI into our entire development process and built a platform-wide framework that lets every product team leverage AI for every feature they ship. Along the way, we pushed AI tools far past what the industry currently calls vibe coding. Vibe coding works for a Streamlit dashboard built for a single use case, a one-off script to parse a CSV, a prototype REST endpoint, a scratch tool to convert formats — confined work where every constraint sits in front of the model at once. It breaks for a fifty-file refactor where every module shares conventions, a platform with cross-cutting authentication and observability, a codebase where one wrong abstraction propagates across hundreds of files. We have been operating across that gap at platform scale.

These are the ten principles we have learned operating across it. Each one names a discipline we adopted because building AI-native software at this scale required it; each one points at a specific way AI-native development goes wrong without the discipline in place. Together they describe how we believe production AI systems should be built. The downloadable one-page checklist is at the end — copy it into your team's runbook today.

Act I — The Method

Principle 1 — Spec-Driven Development

The work the AI cannot do — defining what should be built and what counts as built — is now the work that matters most. Specs are the real product; code is what the spec generates.

The most important shift in software engineering in 2026 is the return of the spec to center stage. When a specification is precise enough, AI can generate most of the implementation from it. The spec is no longer a document developers write to give other developers context after the fact. The spec is the intellectual property. Code becomes a derivative artifact — important, but secondary to the spec it was generated from.

Tekion has been operating on this premise for longer than most, and we have built a rigorous variant of the pattern: the Intent Specification Document (ISD), a structured fifteen-section template that captures a system's intent, behavior, quality gates, escalation rules, and testing contract end-to-end. ISDs are evidence-rated at four trust levels (from verified-internal down to LLM-inferred), they separate app-independent intent from Tekion-specific implementation mapping (so the same spec is portable across implementation paths), and every ISD passes through cross-model review before it ships. Every claim in an ISD traces backward to the evidence that supports it and forward to the rule or test that enforces it — we call this trace the Golden Thread, and it is what makes the spec, the implementation, and the audit trail one continuous chain rather than three separate documents. The human supplies the intent and the spec; AI generates the implementation; automated tests confirm the implementation matches the spec.

Paper — Tekion's AI-native strategy and document platform, which one engineer plus Claude grew from zero to 52,000 lines of code and 123 MCP tools in two months, and which runs in the 140s MCP tools today — was built spec-driven from day one. Our strategy and product documents now move through the same evidence-anchored pipeline we use to ship code: the same trust ratings, the same cross-model review, the same audit trail. The two-month build is several times the output a single engineer would produce in that window without AI, and — more important than the speed — the foundations stay sound because the spec stays sound.

This is not a story about replacing engineers. It is a story about what skilled engineers produce when they spend zero time on the implementation an automated test would have caught anyway, and full time on the three roles a model cannot hold: framing intent, defining the spec, and validating that the result delivers on the intent.

Principle 2 — AI Writes the Rule-Based Logic

Use AI for what AI is good at; use rules for everything else; and let AI write the rules.

The design heuristic behind every architectural choice that follows: AI is exceptional at code generation, rule generation, reasoning across unstructured inputs, and producing text and media. For every other class of problem, purpose-built rule-based logic still wins on accuracy and on cost. The shift is not that AI replaces rule-based systems — it is that AI now writes the rule-based systems for us, quickly and cheaply.

A naive AI-native architecture invokes a large model at every decision point in the stack. The model becomes the runtime, the routing, and the orchestration glue all at once. It is expensive, the system's behavior lives inside opaque prompt context, and you cannot inspect why one input produced a particular output. The pattern that scales does the opposite: the model is used where models genuinely outperform — generating code, generating rules, reasoning across messy inputs, producing text and media — and explicit rule-based logic runs everything else. The decision between the two gets made deliberately at every layer of every system.

The reason this is the most consequential shift of 2026: until recently, the engineering bottleneck was building the rule-based logic. Now an AI writes that logic in minutes — readable, inspectable, fast at runtime — and a human reviews and ships. Whole categories of cost, latency, and unpredictability that were considered inherent to AI-native software turn out to be products of putting the model in the wrong layer. Move the model to where it shines; put rules everywhere else. Most AI-native architectures we see today keep the LLM in the runtime path on operations a rule would handle better — they put the model where the rule belongs. The pattern that scales does the opposite: the model writes the rule once at build time, and the rule runs at runtime. Concrete shape of this on Paper: the platform runs 108 automated dependency checks across six classes today, grown from 30 at inception — each one a rule the AI wrote so the model does not have to evaluate it at runtime.

Act II — The Safety Net

Principle 3 — Three Independent Layers

No single check is enough. We use three independent layers — prevention, detection, and recovery — and each catches what the others miss.

Once you have an AI agent in your system, runtime improvisation is a property of the model, not something you can prompt your way out of. Across our internal AI engineering platform we observe agents deviate from explicit instructions in roughly one in ten cases — observed on our internal AI tooling, not claimed as an industry-stable rate or as a constant across task types and model versions. A single check at any one point in the system is insufficient at any scale you care about. We use three independent layers.

Prevention lives upstream of the call: clear instructions placed at the point of action (Principle 5), plus rule-based code everywhere rule-based code outperforms a model call (Principle 2). Detection lives at the call: automated test patterns that assert the agent's chosen action matches the spec, not its narrated reasoning — agents are very good at narrating an approach they did not actually take. Recovery lives downstream: a rollback or kill-switch path any caller can trigger when detection fires. ORBIT, Tekion's AI-native engineering framework, is what enforces this three-layer discipline across every AI capability we ship, from internal engineering agents to the customer-facing systems in T1.

The three layers fail independently. A prompt change does not break the test patterns; a false positive in the test patterns does not trigger the rollback; the rollback only fires when detection asks. Composability is the entire point. And recovery only works when actions are reversible: in any system where side effects cannot be undone — a sent communication, a committed transaction, a billed charge — prevention and detection absorb proportionally more of the work, because the recovery layer cannot.

Principle 4 — Cross-Model Verification at High Stakes

Get a second opinion. At the decisions that matter, route the input through two different AI systems and reconcile.

At the decisions that matter — security review, architecture choices, strategy documents an external reader will treat as authoritative — we route the same input through two or three different AI model families and reconcile the outputs. Family diversity matters. Two models from the same family tend to agree on substance and disagree only on cosmetics; two models from different families disagree on substance more often, and that disagreement is where missed problems surface.

In practice, the model families divide the work in characteristic ways. Claude Opus and Sonnet are our primary development models — strongest in our pipeline for code generation, spec authoring, and structured reasoning across long context. GPT excels at logic auditing: when we need a different family to challenge a claim chain or detect a circular argument, we route it there. Gemini is our default for fact-checking, and is also our strongest model for media-heavy work and for running large evaluation batches against curated test sets. We match the model family to the kind of verification the high-stakes path needs.

Family diversity is not the same as independent verification — shared training data and shared training methods mean different-family models can converge on the same wrong answer. The principle holds for known classes of risk, not as a universal hedge. So we name the high-stakes path explicitly, and any output that travels that path carries a second-family check before it is published. The trap is treating cross-model verification as something you turn on after a problem appears. The principle is to turn it on for the categories where a problem after the fact is the costly outcome.

Principle 5 — Directive Position Beats Directive Content

Where you put an instruction matters more than what it says. Put rules right next to the work, not at the top of the document.

The single most non-obvious finding from operating AI agents at scale: an instruction's position matters more than its content. Across our internal AI engineering platform, an instruction placed at the top of a configuration file, system prompt, or README is followed less reliably than the same instruction placed inline — directly above the line of code or workflow step where the agent actually has to act on it. Moving the same instruction from the preamble into the comment immediately above the call site raises compliance sharply. The roughly-one-in-ten deviation rate from Principle 3 is what made this measurable on our internal AI tooling; multiple internal experiments have reproduced it.

This is the inverse of the rule that holds for human readers. For humans, important things go at the top because attention starts there. Our working explanation for the agent version is mechanical: by the time of the action, an instruction at the top has already been buried under everything else in the context, while an instruction inlined at the point of action is one the agent cannot route around without an explicit choice. The mechanism is our best account; the result is what we measured. The concrete shape: a directive on line 1400 of a configuration file is invisible to an agent processing step three at line 170 — the agent never sees the directive at all in the context window it is actually using.

The practical consequence: centralize rules for human governance, inline rules for agent compliance, and run both. The cost is some duplication; the benefit is a sharp reduction in the kind of "the agent ignored a rule we wrote down" surprises that look inexplicable until you understand where the rule was actually sitting.

Principle 6 — Review the Plan, Not Just the Call

Approve the agent's plan, not its individual moves. A bad plan with safe-looking individual steps is still a bad plan.

When an agent proposes a sequence of actions, we do not approve them one-by-one as they execute. We approve the plan against the permission model first, then let it run. A concrete example: an agent's plan ends with submitting a deal that violates dealer policy, scheduling a payment outside an approved billing window, or modifying a customer record in a way that triggers an audit-trail requirement. Reviewing each call in isolation against an allow-list approves the safe-looking earlier steps one by one and leaves the agent positioned to execute the forbidden step through whatever path remains. Plan-level review reads the full proposed sequence as the unit of approval.

The permission model treats variations of the same action as separate cases. Writing to staging and writing to production are different actions. Reading a customer record and updating it are different actions. Submitting a deal under one dealer's authority and another's are different actions. Approving a refund below a threshold and above it are different actions. Keeping that level of detail current is patient, repeatable work the agent will quietly encourage you to skip. Skipping it is how unintended actions reach production. We do not skip it.

Act III — The Operations

Principle 7 — Build the Evaluations Before You Build the Agent

Decide what "works" looks like before you build the agent. The test set is the contract.

Evaluation is not something you bolt on after the agent works in a demo. Evaluation is the artifact you build first, because the evaluation is what defines "works" before any code is written. Two disciplines work together. Curated test sets are hand-picked, expert-reviewed reference inputs that define the quality bar — the set we measure every change against. Continuous evaluation runs the curated set against the system continuously through development and into operation, with a parallel-mode test for every prompt and every model change: the candidate version runs alongside the current version, both outputs are captured, and the two are reconciled offline before any user-visible behavior changes.

Two properties matter more than raw size: variants (the number of meaningfully different input shapes a single test exercises — language, intent phrasing, edge case, error path) and coverage (whether the set spans the range of behavior the system's autonomy level can produce). The working minimums we use across our internal AI engineering platform — conventions, not industry standards — scale with autonomy tier: 5 items for an advisory check, 20 for a routine assistant, 50 for a supervised-autonomous agent, 100 for an agent inside guardrails, 200+ for a fully autonomous agent. For T1's customer-facing scale — every dealer in our portfolio, hundreds of distinct workflows, real-world inputs with the variance only production traffic produces — the curated sets and the variants on them are orders of magnitude larger than those internal minimums, and they grow as the platform expands. A skeptical reader will reasonably ask why a particular set size is right for a particular system. The answer is always empirical: the set is large enough when it surfaces regressions that smaller sets miss, and we know because we have run both.

The other half of this principle is the operating model. Pre-release evaluation tells you the change is plausible. Continuous evaluation tells you it survives. The combination — curated sets that define quality, continuous evaluation that watches it — is what catches the regressions a one-shot pre-release test cannot see.

Principle 8 — Continuous Quality Monitoring Across Six Signals

Watch six signals continuously. Quality regression does not announce itself.

Quality issues in AI systems do not announce themselves. Across our internal AI engineering platform, we watch six signals continuously, and the pattern across signals tells us when something has changed before any change reaches an engineering team downstream or ships into a customer-facing system. The six: curated-set pass rate (the reference benchmark moves), output-quality shape (the distribution of output quality across measured dimensions changes), confidence calibration (the agent's stated confidence stops agreeing with actual outcome correctness), escalation rate (the human-in-the-loop rate moves without obvious cause), cost, and latency. Several overlap with standard infrastructure telemetry; the rest are specific to AI systems.

Each signal gets a threshold response across four severity levels — notification, human review, partial rollback, kill-switch — applied per signal, not per agent. The rule we apply: one signal moving is information; two signals moving together is a finding; three is an incident. The whole pipeline wires to the same on-call surface that handles infrastructure incidents — not a Monday dashboard, because the urgency is the same. These six signals cover quality regression. They do not cover bias against user classes, privacy leakage, or low-rate data exfiltration; those need their own monitoring layers. The discipline is to know what each layer catches and what it does not.

Principle 9 — No Silent Failures

When the system falls back, the user has to see it. Anything else propagates downstream.

When an AI system falls back to a degraded mode, the user has to see it. Not the log, not the cache, not the trace — the surface they actually read. This sounds obvious; it is the principle most often violated in early AI products. The shape of the problem is consistent: a fallback path emits its degraded output, the downstream consumer treats that output as authoritative because nothing in the surface tells it not to, and three steps later the system has propagated a low-quality answer that looks identical to a high-quality one.

We learned this inside our internal AI engineering platform, when a fallback we built emitted its output only to a log file. The next stage of the workflow read the fallback output, treated it as canonical, and the divergence spread across half-a-dozen downstream stages before we noticed. None of this reached customers — it happened inside our own tooling — but the lesson was unambiguous. The fix is not better logging. The fix is treating the user-visible output surface as the only honest place to report state. The test pattern: for every degraded-mode path in the system, an explicit check that the path emits a marker that travels all the way to the surface the user actually sees — a banner, a confidence badge, a prefix, an inline note. The constraint is only that the marker survives the trip.

Principle 10 — Token Budgets Are Targets, Not Gates

Manage AI cost by replacing AI calls with rules where rules win; treat the remaining budget as an efficiency target, not a hard limit.

Everyone running AI in production knows the same thing by now: large models are expensive. The question is how to manage that cost without sacrificing the output. Our answer has two parts.

First — and this is where Principle 2 returns to do the heaviest lifting — the cheapest model call is the one a rule replaces. The fastest way to bring an AI system's cost down by an order of magnitude is to inspect every model call in the hot path and ask whether the same answer would have come from a rule-based path the AI itself could have written in minutes. On Paper, roughly seven of every ten pipeline operations already execute without an LLM call — they are rule-based steps the AI generated as code up front, running at greater than 99% success rate. The remaining three in ten escalate to a model call, and those are the ones that carry the cost. The architectural move is not "use less AI"; it is "have the AI write the rule so the next run does not need the model at all." The discipline is to look, every quarter, for model calls that have become rule-shaped over time and convert them.

Concrete instance: one Paper operation that used to be an LLM call is validating that a new MCP tool conforms to the platform schema. Originally a Claude call read the tool definition and returned pass/fail. We replaced it with a Python validator the AI wrote in an afternoon — walks the schema, checks each field, emits a structured error trace. Every Paper run runs this rule today. Multiply by the hundred-plus rule-based steps in the pipeline and the seven-of-ten ratio is what you get.

Second, treat the model budget that remains as an efficiency target, never as a functional gate. A budget tells you when a path has become more expensive than it should be. A gate would tell you to skip the deliverable rather than exceed the budget — and the cost of a skipped deliverable, in our experience, almost always exceeds the cost of the tokens. If a budget keeps getting hit, the right response is to fix the path — convert calls to rules where the rules win — not to truncate the output. The exception worth naming: in systems with hard cost ceilings (embedded inference, regulated per-call billing, edge runtimes), partial output may be preferable to no output, and there budgets do become gates by design.

Why These Ten

These principles are drawn from operating an AI-native engineering organization through 2026 — Paper, ORBIT, and our internal AI productivity platform together, with T1 as the customer-facing surface they all feed. Every principle is anchored to behavior we have observed across that infrastructure. The system metrics cited (Paper at 52,000 lines, 123 MCP tools, and 1,914 tests in two months; in the 140s MCP tools and the low-2,000s tests today; the roughly-one-in-ten deviation rate) are measured on internal infrastructure as of 2026-05-24; the qualitative comparisons reflect our experience operating these systems. All the systems referenced are in active production for our own teams today.

What we unveil through T1 at Tekion One is the first wide customer-facing application of this discipline. The assurance we want to give every dealer, every analyst, every investor, and every engineering leader reading this is the one we believe matters most: we are not betting on AI tools to be careful. We are betting on the spec-driven discipline around the tools — and the rule-based logic the AI now writes for us — to do the heavy lifting. The principles above are that discipline, written down so you can apply them too.

Why The Discipline Compounds

The discipline produces artifacts, and the artifacts compound.

Every ISD we write becomes a reusable input the next time AI generates a related implementation. Every curated test set becomes a regression gate the next agent has to clear. Every rule the AI writes is faster, cheaper, and more inspectable than the model call it replaces — and stays that way. A team starting from zero today builds the discipline; a team that has been running it for years has the spec library, the eval corpus, the rule library, and the audit trail of how each was checked. The work is reproducible; the body of work is not.

The investments accrue under one Golden Thread — every claim in a spec linking to the evidence that supports it and the rule or test that enforces it. Tracing any line of code back to the spec it implements, and any spec line back to the evidence that warrants it, is not a compliance add-on; it is how the system stays honest. And it inverts the usual AI-cost trajectory: a system run this way gets cheaper per unit of work over time, not more expensive, because every rule the AI writes today is a model call it does not make tomorrow. More broadly: systems built this way improve with use — every spec, test, and rule the AI writes reduces future cost and increases correctness.

Layer 2 — Structured Reference

# Principle What it protects against
1 Spec-Driven Development Vibe-coding plateaus around medium complexity; without a precise spec, AI generates fragile output and engineering teams cannot scale
2 AI Writes the Rule-Based Logic LLM-as-glue architectures spend model tokens on every decision a rule would have answered cheaper and more accurately
3 Three Independent Layers A single check leaves the gap the other layers would have closed; runtime improvisation is a model property at ~10% rate
4 Cross-Model Verification at High Stakes Single-model blind spot ships to a downstream consumer unreviewed
5 Directive Position Beats Directive Content An instruction at the top is ignored at the point of action; instructions belong inline at the decision
6 Review the Plan, Not Just the Call Forbidden action rejected; the plan that produced it left intact, agent re-routes through the next path
7 Build the Evaluations Before You Build the Agent Evaluation bolted on after-the-fact has no quality bar; curated sets plus continuous evaluation define and protect quality from day one
8 Continuous Quality Monitoring Across Six Signals Slow regression accrues across releases; no single signal alone names it
9 No Silent Failures Degraded output reaches the user surface with no marker; downstream treats it as canonical
10 Token Budgets Are Targets, Not Gates LLM calls left in the hot path that a rule would have answered cheaper; deliverable dropped to fit a budget

What links the ten: the Golden Thread — every claim in a spec links backward to evidence and forward to the rule or test that enforces it. The principles are how we operate; the Golden Thread is how each operation is traceable end-to-end.

Glossary

AI-native: software where AI participates in every layer of the development process and runtime, with a platform-wide framework governing how.

Spec-driven development: the engineering pattern where a precise specification defines the system and AI generates most of the implementation.

ISD (Intent Specification Document): Tekion's structured fifteen-section spec format — evidence-rated, cross-model-reviewed, separating app-independent intent from implementation-specific mapping.

Golden Thread: the trace discipline in our pipeline — every claim in a spec links backward to the evidence that supports it and forward to the rule or test that enforces it; the mechanism by which every output we ship is auditable end-to-end.

ORBIT: Tekion's AI-native engineering framework — the rubric, lifecycle, and architecture taxonomy every AI capability passes through.

T1: Tekion's AI-native dealership platform, launching at Tekion One.

Paper: Tekion's AI-native strategy and document platform; in the 140s MCP tools today; the same pipeline produces strategy documents and ships code.

Vibe coding: prompting AI to write code without an explicit spec — works well for confined work, breaks at larger scale.

Rule-based logic: purpose-built code an LLM can write and does not need to execute at runtime — faster, cheaper, and more accurate than the equivalent LLM call.

Three independent layers: prevention + detection + recovery, each with independent failure modes.

Cross-model verification: routing the same input through 2–3 different AI model families and reconciling at high-stakes decisions.

Directive position: the physical location of an instruction in an agent's context; point-of-action beats preamble.

Plan review: approving the agent's full proposed sequence against the permission model, not each individual call.

Curated test set: hand-picked, expert-reviewed reference inputs used as the quality bar.

Continuous evaluation: running curated sets against the system continuously through development and into operation, with parallel-mode testing for every change.

Quality monitoring: continuous watch over six output-quality signals with threshold-based response.

All numeric claims accurate as of 2026-05-24.

In this Article

0%

Share