Research / Strategy

AI Evaluation & Decision-Making

Winning AI teams are not just better prompt writers. They are better quality-system designers. They turn traces into insight, insight into datasets, datasets into evals, evals into product decisions, and product decisions into compounding customer trust.

Core thesis

Winning AI teams are not just better prompt writers. They are better quality-system designers. They define what good work looks like, instrument the work, study traces, build datasets, run evals, monitor production, and use failures to improve faster than competitors. The product is not the code — the product is the quality system.

Measure work, not usage

Agent Success Rate is the new north star

Composite metric combining task completion, user actions (accept, edit, retry), thumbs, semantic session signals, complaints, and eval scores. Usage alone tells you adoption. ASR tells you whether the product works.

The AI Flywheel

Agent Success Rate: composite north star for AI work quality
Trace Analysis: personally inspect real and synthetic sessions to find user intents and failure modes
Reference Dataset: golden outputs, edge cases, production failures, labels, notes, metadata
Offline Evals: pre-release regression suite for prompt, model, tool, or architecture changes
Online Monitoring: production scoring, drift detection, support themes, feedback back into the dataset

AI opportunity evaluation rubric

Score every serious AI opportunity on 10 dimensions: user intent clarity (recurring input families vs endless one-off prompts), observable success (acceptance, completion, resolution, edits, escalation vs "seems helpful"), golden-output feasibility (can a domain expert create ideal examples?), trace availability (realistic data source exists?), deterministic checks (schema, citations, tool calls, required fields), LLM-judge suitability (binary single-criterion judges possible), failure visibility (failures can be named and counted), feedback loop (usage produces labels and signals), risk and fallback clarity (refusal or escalation threshold is clear), compound advantage (eval data becomes a moat). Prefer opportunities where success can become traces, labels, datasets, evals, experiments, and monitoring. Avoid ideas where "good" remains vague after prototyping.

Eval automation decision tree

For every failure mode

Classify first, automate second. Specification gap (was desired behavior never clearly specified? Fix the prompt or spec). Architecture gap (does the system lack retrieval, tool, memory, state, permissions, or data? Fix the architecture). Generalization gap (does it work inconsistently across inputs? Build or strengthen evals and examples). Only automate generalization failures. Do not build elaborate evals for missing instructions or missing capabilities.

Eval suite rules

Code evals first: schema, required fields, citations, no invented IDs, enum labels, tool order, parameters, latency, cost, token usage, safety rules
LLM judges only for narrow semantic or taste criteria
Binary pass/fail beats arbitrary 1-10 scoring
Calibrate judges against human labels; track TPR and TNR
Add near-miss examples to reduce false passes
Human review required when judges do not align, stakes are high, or experts disagree
Dataset rows need: input, output, tool calls or intermediate steps, reference, labels, trace code, notes, segment, intent, model and version, timestamp
Overrepresent hard cases, not only frequent easy cases

Production monitoring rules

Code evals on 100% of traffic when feasible. LLM judges on a 1-10% sample plus flagged sessions. Monitor offline vs online pass rate, code failures, judge pass rate, latency, cost, retries, edits, accepts, thumbs, support themes, abandonment, and "I do not know" rate. If offline metrics improve while complaints rise, the eval suite is measuring the wrong thing.

AI-specific failure modes

Missed ambiguity: the model answered confidently when the input was unclear
Unsupported claim: the model made a factual or quantitative claim without evidence
Wrong tool: the model used or did not use a tool inappropriately
Missing required field: the output omitted a field the spec requires
Invented citation: the model cited a source that does not exist
Weak synthesis: the model summarized without adding useful structure or insight
Dropped multi-intent: the model addressed one user intent and ignored the other
Failed escalation: the model should have refused or escalated but did not

Explore all frameworks

The AI Growth Imperative

AI Growth Defensibility

Acquisition Strategy in AI

Monetization & Pricing in AI

Retention & Engagement in AI

AI Prototyping

AI-Native Product Teams

The Expectation Reset

PM in the AI Era

Growth Loops & Acquisition

The Four Fits Framework