Engineering standards

How we evaluate AI before it goes near your business

Most AI agencies open with a demo. We'd rather show you the bar a system has to clear before it touches your customers, your team, or your data — and what we do when it doesn't clear that bar yet. This is the standard every engagement is held to, from the audit through delivery and the months after.

Measured evals, not vibes

Before any agent, assistant, or automation goes live, we write down what "working" means for that specific job — a set of real example inputs with the outcomes we'd expect — and test against it before and after every meaningful change. "It looked right when I tried it" isn't a result we ship on. If we can't point to a measurement, we don't claim it.

The RAG triad — for anything that answers from your documents

A "knowledge assistant" that answers from your SOPs, policies, contracts, or product docs is scored on three things before it's trusted with real questions: context relevance (did it pull the right material for this question), groundedness (is every claim in the answer actually traceable to that material, with a citation), and answer relevance (does the response actually address what was asked). A confident-sounding answer that fails any one of these is worse than no answer at all — so all three have to hold up.

OWASP LLM Top 10 awareness

We design against the known failure modes of LLM-powered applications — prompt injection, insecure output handling, sensitive-data leakage, excessive agency, supply-chain and model-misuse risks, and the rest of the OWASP Top 10 for LLM Applications — before launch, rather than discovering them from a user afterward.

Human-in-the-loop by default

AI drafts, qualifies, summarizes, and flags. People decide on anything that touches money, legal commitments, or a relationship that matters. Every system we build has a defined point where confidence drops and a human takes over — and that threshold is something you can see, and adjust, not a black box.

Per-tenant isolation

If a system holds data for more than one client, department, or workspace, that data is scoped and separated at the source — one tenant's documents, conversations, or history are never reachable from another tenant's queries. This is something we verify end-to-end, not something we assume because the underlying platform "probably handles it."

Honest limitations

AI agents make mistakes, hallucinate when given insufficient context, and occasionally behave in ways nobody fully predicted. We'd rather tell you upfront where a system is weakest, where review stays mandatory, and what we're genuinely not confident about — than have your team find out the hard way. If a use case isn't ready for AI yet, we'll say so, even if that means a smaller engagement.

What this means for Knowledge Assistants (RAG)

Retrieval-augmented "ask your documents" assistants are some of the most useful systems we build — and some of the easiest to get subtly wrong: fluent, confident answers sourced from the wrong document, or no document at all. Before we propose or bill for a knowledge assistant, it has to clear the RAG triad above on your content — measured retrieval relevance, groundedness with working citations, and verified isolation if it serves more than one team or client. If a proposed system isn't there yet for your data, we'll say so directly, and scope it as a roadmap item your audit can plan toward — not something we sell you before it's ready.

Where this leaves you

We don't have published case studies yet — the first engagements under this standard are still landing, and we'll publish real outcomes (with permission) once they exist, not before. Until then, this page is the bar we hold ourselves to, and you're welcome to probe any part of it during your audit. None of the above is a guarantee of a specific result for your business — what we build, and how it performs, still depends on your data, your processes, and the scope we agree on together.

Curious whether — and where — these standards apply to your operation? Start with a free AI Systems Audit and we'll walk through it together.