Research / Strategy
AI Evaluation & Decision-Making
Winning AI teams are not just better prompt writers. They are better quality-system designers. They turn traces into insight, insight into datasets, datasets into evals, evals into product decisions, and product decisions into compounding customer trust.
Core thesis
Winning AI teams are not just better prompt writers. They are better quality-system designers. They define what good work looks like, instrument the work, study traces, build datasets, run evals, monitor production, and use failures to improve faster than competitors. The product is not the code — the product is the quality system.
Measure work, not usage
Agent Success Rate is the new north star
Composite metric combining task completion, user actions (accept, edit, retry), thumbs, semantic session signals, complaints, and eval scores. Usage alone tells you adoption. ASR tells you whether the product works.
The AI Flywheel
- Agent Success Rate: composite north star for AI work quality
- Trace Analysis: personally inspect real and synthetic sessions to find user intents and failure modes
- Reference Dataset: golden outputs, edge cases, production failures, labels, notes, metadata
- Offline Evals: pre-release regression suite for prompt, model, tool, or architecture changes
- Online Monitoring: production scoring, drift detection, support themes, feedback back into the dataset
AI opportunity evaluation rubric
Score every serious AI opportunity on 10 dimensions: user intent clarity (recurring input families vs endless one-off prompts), observable success (acceptance, completion, resolution, edits, escalation vs "seems helpful"), golden-output feasibility (can a domain expert create ideal examples?), trace availability (realistic data source exists?), deterministic checks (schema, citations, tool calls, required fields), LLM-judge suitability (binary single-criterion judges possible), failure visibility (failures can be named and counted), feedback loop (usage produces labels and signals), risk and fallback clarity (refusal or escalation threshold is clear), compound advantage (eval data becomes a moat). Prefer opportunities where success can become traces, labels, datasets, evals, experiments, and monitoring. Avoid ideas where "good" remains vague after prototyping.
Eval automation decision tree
For every failure mode
Classify first, automate second. Specification gap (was desired behavior never clearly specified? Fix the prompt or spec). Architecture gap (does the system lack retrieval, tool, memory, state, permissions, or data? Fix the architecture). Generalization gap (does it work inconsistently across inputs? Build or strengthen evals and examples). Only automate generalization failures. Do not build elaborate evals for missing instructions or missing capabilities.
Eval suite rules
- Code evals first: schema, required fields, citations, no invented IDs, enum labels, tool order, parameters, latency, cost, token usage, safety rules
- LLM judges only for narrow semantic or taste criteria
- Binary pass/fail beats arbitrary 1-10 scoring
- Calibrate judges against human labels; track TPR and TNR
- Add near-miss examples to reduce false passes
- Human review required when judges do not align, stakes are high, or experts disagree
- Dataset rows need: input, output, tool calls or intermediate steps, reference, labels, trace code, notes, segment, intent, model and version, timestamp
- Overrepresent hard cases, not only frequent easy cases
Production monitoring rules
Code evals on 100% of traffic when feasible. LLM judges on a 1-10% sample plus flagged sessions. Monitor offline vs online pass rate, code failures, judge pass rate, latency, cost, retries, edits, accepts, thumbs, support themes, abandonment, and "I do not know" rate. If offline metrics improve while complaints rise, the eval suite is measuring the wrong thing.
AI-specific failure modes
- Missed ambiguity: the model answered confidently when the input was unclear
- Unsupported claim: the model made a factual or quantitative claim without evidence
- Wrong tool: the model used or did not use a tool inappropriately
- Missing required field: the output omitted a field the spec requires
- Invented citation: the model cited a source that does not exist
- Weak synthesis: the model summarized without adding useful structure or insight
- Dropped multi-intent: the model addressed one user intent and ignored the other
- Failed escalation: the model should have refused or escalated but did not
Explore all frameworks
The AI Growth Imperative
AI Growth Defensibility
Acquisition Strategy in AI
Monetization & Pricing in AI
Retention & Engagement in AI
AI Prototyping
AI-Native Product Teams
The Expectation Reset
PM in the AI Era
Growth Loops & Acquisition
The Four Fits Framework
Local Business Lead Scoring Framework
Next Step
See the customer-growth gaps before competitors close them.
Start with the free opportunity audit or go straight to a working session with Jake.
Email Jake directly at jake@northsignal.studio