AI Session Score - Chainloop Documentation

Overview

AI Session Score is a per-PR confidence signal for AI-assisted and AI-driven code changes. It reads what the AI did, what the user asked for, and what the diff contains, and turns it into a single reviewer-facing artifact: a score, a flag, and a list of items worth looking into. AI Session Score is powered by session data captured through chainloop trace, the source code changes in the PR, and other contextual signals such as AI code review bot comments. When developers use AI coding agents with chainloop trace enabled, every session — models used, tool calls, code changes, and the full conversation — is automatically recorded and attested. AI Session Score evaluates all of this data together on each PR to produce its verdict.

To start generating AI Session Scores, enable chainloop trace in your repositories. No additional configuration is needed — once session data flows in, scores are computed automatically on PRs.

PR authors — look for sections titled “Improving the score” for what to address before merge.
Reviewers — look for “How to read it” and the items list discussion for where to focus attention.

What You’ll See

Every AI Session Score result has:

Summary — one sentence with the headline judgment and the most important reason.
Flag — Red, Yellow, or Green, indicating confidence in the AI part of the PR.
Score — a 0-100 indication of how well the PR met the criteria for a good change done with or by AI.
Items list — the actionable output: specific things a reviewer should check, with a link to the relevant moment in the session or file in the diff.

Plus per-criterion sub-flags showing which axes raised the concern.

How to Read a Result

Start at the summary line. It tells you the headline and the dominant reason in one sentence. If you only have a minute, this is what to read. Use the flag for triage, not the score. The flag is derived from the distribution of sub-flags, not from the number. A change with one Red sub-flag will be Yellow even if everything else is green — that’s intentional. The score is a band, not a precise measurement; the same input can return 65 on one run and 72 on the next. Score and flag are independent. There is no fixed score band that maps to Red, Yellow, or Green. A score of 75 might be Green on one PR and Yellow on another, depending on which sub-flags fired. Per-criterion sub-flags tell you what kind of concern this is. A Yellow on Verification means something different from a Yellow on User trust signal. Each criterion section in the Score References below explains what its flag means and what to check when it fires. Treat the items list as the product. A green score does not mean “skip review.” It means “use the items list to focus where you look.” The list is curated, de-duplicated, and ordered by severity — start at the top. Each entry in the items list has:

Field	Description
Severity	`high`, `medium`, or `low` — useful to understand impact on the change
Criterion	Which axis flagged this (Verification, Alignment, etc.) — tells you what kind of concern it is
Summary	One sentence describing what was observed
Reference	A pointer to files or AI sessions (may not always be present)

The Six Criteria

AI Session Score is built from six independent criterion judges plus a final aggregator. Each criterion scores one axis of confidence:

Score	What it asks
Context & Planning Score	Was the AI set up to succeed, or set up to wing it?
Alignment Score	Did the AI stay on the task that was actually asked?
Scope Discipline Score	Did changes stay within scope, or did the AI feature-creep?
Solution Quality Score	Is the change a real fix, or a workaround that masks the problem?
Verification Score	Was the change actually validated?
User Trust Signal Score	What does the user’s behavior across the session tell us?

Five of the six criteria can be addressed by changing how you work. User trust signal is the exception — it’s interpretive. A flagged result there scores how the session went and is best used as input for the next session, not as something to fix on the current PR.

A criterion can abstain when there isn’t enough evidence to judge. Abstention is not the same as green — the items list will surface a “this wasn’t checked” item explicitly, so you can decide whether to look harder yourself.

What AI Session Score Is Not

Not a merge gate. AI Session Score does not block PRs.
Not a substitute for review. A green score still requires review; a red score does not mean the code is wrong, only that a reviewer should look harder.
Not a developer score. It scores a change, not a person.

When the Score Updates

A PR is opened. The score computes on the initial content.
A new commit lands on the PR. The score recomputes, with a small delay so session data finishes syncing.
A recognized AI code review bot comments. The score takes that feedback into account and recomputes.

Other events do not trigger a recompute. If you’ve addressed an item and don’t see the score change, push a commit — that’s the trigger. AI Session Score only runs on PRs that have session data captured by chainloop trace. A PR authored without AI assistance is not scored at all — the absence of an AI Session Score means “this flow does not apply,” not “green.”

Score References

Context & Planning Score

This criterion asks whether the AI was set up to succeed — the framing, planning, and guidance available before substantive work began. It does not check whether the code is good (that’s Solution Quality Score) or whether the user reacted well later (that’s User Trust Signal Score).

What it checks

Setup quality — what guidance and framing the AI had before substantive editing began. This includes the user’s initial framing, any plan or spec produced before code was written, and any project-level guidance loaded at session start. What counts as a plan: The judge is format-agnostic. A structured planning artifact, a spec committed before code was written, or freeform prose that lays out the approach all count. What matters is the content — a plan that names files-to-touch, the chosen approach, and acknowledged unknowns counts. A plan that is just “I’ll fix it” doesn’t count, regardless of where it appears.

What raises the flag

Yellow: vague initial prompt that the user steered actively as work progressed; AI asked some clarifying questions; a partial plan appeared.
Red: one-line prompt with no constraints; AI dives straight into code; consequential decisions made silently; no acknowledgment of unknowns.

A small task with a thin prompt is not automatically Yellow — context proportionate to the task is fine. The problem is a thin prompt that should have been thicker for the task that followed.

Why this isn’t green

Three patterns commonly trigger Yellow or Red on Context & Planning:

Vague one-liner, no plan, scope drift. The first user turn was a thin instruction with no context, no constraints, and no acceptance criteria. The AI did not produce a plan or surface assumptions before editing.
Silent decisions on consequential choices. The AI made decisions that materially shape the change — choosing one library over another, picking an architectural pattern — without naming the decision or surfacing a tradeoff.
Plan landed too late to frame the work. A plan appeared in the session, but only after substantial editing was already done. The plan documented work that had happened rather than guiding work that was about to happen.

Improving the score

Most of the leverage on this criterion is on the next session, not the current PR:

State intent, then constraints. Lead with what you want and what you do not want. “Add X to file Y, but don’t refactor the surrounding code” gives the AI more to work with than “add X.”
Ask for a plan before editing. A short plan covering files to touch, the approach, and unknowns is cheap to produce and lets you redirect before any code is written.
Surface decisions explicitly. When you notice the AI reaching a fork (library choice, schema design, naming), ask which options it considered and why it picked one.
For an in-flight session, write the plan late if needed. Even a late plan is more useful than no plan.

Alignment Score

This criterion asks whether the AI stayed on the task that was actually asked, and whether what the AI said it did matches what the diff actually does. This is only about whether work matches what was asked. The quality of the work is judged by Solution Quality Score. Unsolicited extras are judged by Scope Discipline Score.

What it checks

Two questions, evaluated together:

Intent vs. behavior. The user stated an intent; did the AI do that, or something else?
Claim vs. reality. Along the way and at the end, the AI summarized what it did. Do those summaries match the diff?

What raises the flag

Yellow: mostly aligned with some drift the user caught and corrected; AI summaries that overstate what was done; one volunteered side-action that was rolled back.
Red: AI did something materially different from what was asked; claimed coverage or behavior does not match the diff; user had to redirect repeatedly across the session.

Recovery does not save the score. A session in which the AI eventually delivered correctly, but burned the user’s time on multiple confirmed misalignments along the way, is still flagged.

Why this isn’t green

Claim doesn’t match the diff. The AI summarized the work in a way that sounds right but is subtly wrong.
Premature “done”. The AI declared the change finished one or more times before it actually was.
Volunteered action the user didn’t ask for. The AI took a step that wasn’t part of the request — removing something extra, bypassing a check, “while-I’m-here” cleanup.

Improving the score

Read each item against both ends of the reference. Items on this criterion typically point at both a session turn (the AI’s claim) and a file:line (the actual code). Open them side by side.
Audit other claims, not just the flagged ones. A claim-vs-reality miss is rarely isolated.
Revert anything you didn’t ask for. If the AI did something extra and you don’t actively want it, take it out.
For “premature done” patterns, add a verification step. A small test that pins the actual behavior makes future “done” claims auditable.

Scope Discipline Score

This criterion asks whether the changes stayed within the requested scope, or whether the AI feature-crept beyond the asked-for work. User-originated scope expansion is in-scope. If the user asks for additional changes mid-session, those count as part of the ask and don’t trigger this criterion. The signal it looks for is AI-volunteered scope.

What it checks

The change in the diff vs. what the user actually asked for. This is the only criterion that fundamentally requires the diff — scope is a comparison between what was asked and what got changed.

What raises the flag

Yellow: small unsolicited cleanups — a renamed nearby variable, a fixed typo, a drive-by formatting change.
Red: unsolicited refactors of unrelated code; “while I was here” moments; touching files unrelated to the stated task; opportunistic abstractions (“I made this more reusable”).

Why this isn’t green

Drive-by fix bundled with the feature. A small unrelated change shipped in the same commit or PR — a typo correction in a different module, a path fix in a build file.
Opportunistic refactor of pre-existing code. The AI was reading or modifying a file for the feature and decided to also rename, restructure, or “improve” something it noticed along the way.
Auto-regenerated files outside the feature surface. A tool or generator was run and produced changes in modules unrelated to the feature.

Improving the score

Identify the unrelated changes. Look at the file list. Which files map to the feature, and which don’t?
Split or revert. Move drive-by fixes into their own commit or their own PR.
For opportunistic refactors, judge case-by-case. A rename that genuinely makes the feature clearer can be worth keeping — but it should be acknowledged in the PR description.
For auto-regenerated noise, check whether the regeneration was actually needed. Limit the regeneration to what the change requires.

Solution Quality Score

This criterion asks whether the change is a real fix, or a workaround that masks the underlying problem. It is not about whether the work matches the ask (that’s Alignment Score) and not about whether the change is verified (that’s Verification Score).

What it checks

How the work was done — root-cause fix vs. shortcut, hack, or test-disabled-to-make-CI-green. Most shortcuts get announced in the session — the AI will say “let me just disable this” or “let me wrap this in a try/except” — so the transcript is the primary signal. The diff catches the shortcuts that were taken silently.

What raises the flag

Yellow: working solution with code smells the user accepted; a TODO left behind; a one-off hack that’s labeled and justified; a check-bypass attempt that was caught and reverted before shipping.
Red: errors silently swallowed; tests modified or disabled to make them pass; a check-bypass that landed in the diff (pre-commit hook bypassed, signing disabled, type-check suppressed); commented-out failing assertions.

A bypass attempt that was rejected before it shipped is still a signal — attempt is Yellow, shipped is Red.

Why this isn’t green

Speculative fix shipped alongside the real fix. A genuine root-cause change landed, but a guess-fix shipped in the same commit — a sleep added “in case there’s a race,” a retry added “in case the request fails sometimes.”
Silent error-swallow. A specific error or status code is treated as having a known meaning without verifying that’s actually what the underlying call returns in that case.
Volunteered bypass of a check. The AI ran into a check it couldn’t satisfy — a pre-commit hook, a signing requirement, a type error — and bypassed it on its own initiative rather than fixing the underlying cause.

Improving the score

Look for changes to existing tests. If a previously-passing test was modified to make a new change pass, the test was usually telling you something the change broke.
Find the speculative pieces. Anything that looks like “let’s also do X just in case” without a cited reason. If you remove that piece, does the real fix still work?
Verify any error-as-meaning assumptions. For each place where a specific error or status is being mapped to a domain meaning, confirm that mapping is correct.
Revert any bypasses. If a check was skipped, the change should re-engage it.

Verification Score

This criterion asks whether the change was actually validated — whether anyone, AI or human, observed the new behavior working.

What it checks

Two questions:

Did tests run, and did they pass? Sessions tell you this directly via tool-call traces.
Do the tests assert real behavior? A test can run and pass without actually exercising the change. A test that mocks the function being tested and asserts against the mock doesn’t count — real tests pass real input to the real function and assert on real output.

What raises the flag

Yellow: tests added but only happy path; ran locally but not in CI; user said “looks good” without explicit testing; compile/lint/format only on a small change.
Red: no tests added for new behavior; tests assert against mocks instead of real behavior; AI claimed to test something it never ran; user never verified.

Why this isn’t green

Build / lint / type-check only. Compile-time validation passed, but no test runner was invoked at any point in the session.
Test suite exists but was never run. The codebase has a test framework set up. The AI may have grepped for an existing test file, found none for the specific function, and proceeded without running the broader suite.
Compile-checked, never exercised end-to-end. The change builds, but no one started a dev server, replayed a request, opened the UI, or hit the endpoint to confirm the new behavior.

Improving the score

Check that existing tests actually exercise the new code path. Tests can pass while testing the old code path — if a test only loaded a happy-path fixture, the new branch may never have run.
Run the existing test suite. If the project has one, invoke it. Lint and type-check are not substitutes.
Add at least one test that pins the new behavior. A single test exercising the change is more verification than any amount of compile-time checking.
Exercise the change end-to-end at least once. For UI: open it. For endpoint: hit it. For configuration: apply it in a sandbox or staging.

User Trust Signal Score

This criterion asks what the user’s behavior across the session tells us. The other criteria score what the AI produced; this one scores how the user reacted along the way. Unlike the other five criteria, this one scores something you cannot retroactively change. A session is a historical record. The value of this criterion is as an interpretive lens for the other flags, and as input for the next session you run.

How to use this criterion

The most concrete reviewer move: ask whether the user was steering, or being assisted. If the session looks more like the user manually directing the AI step-by-step, the change may need a closer human read. Read the result alongside the other flags:

Yellow here + Yellow on Alignment or Solution Quality is the strongest version of this signal. The user’s friction was tied to something that actually went wrong.
Yellow here + everything else Green is informational. The user worked harder than usual, but the result looks fine.
Green here + a hacky session in the other criteria is also information — the user appears comfortable with the shortcuts.

What it checks

The user’s reaction arc across the session — corrections, interrupts, frustration markers, abandoned approaches, restarts.

What raises the flag

Yellow: several mid-session corrections; user re-explained intent once or twice; mild friction with no abandonment.
Red: sharp repeated corrections; explicit frustration markers; user restarted from scratch; user abandoned the AI’s approach and finished manually.

Why this isn’t green

Substantive correction after the work was declared complete. The AI declared the change finished and then the user came back with a real correction.
Repeated mid-task interrupts. The user had to interrupt the AI multiple times during the session to redirect.
Steady low-grade friction without escalation. No “stop”, no caps — but the user re-explained intent more than once, asked the AI to redo a piece of work, or quietly took over part of the task.

Improving the score on the next session

A clearer initial framing (state intent, constraints, and acceptance criteria upfront).
Asking for a plan before editing.
Earlier checkpoints — “show me what you’re about to do before doing it.”
Choosing a different agent or workflow if the same friction recurs.

Final Scoring

The final stage rolls up the six per-criterion verdicts into the headline summary, score, flag, and items list a reviewer sees.

Reading a result step by step

Read the summary line. One sentence with the headline judgment and the dominant reason.
Glance at the per-criterion sub-flags. They tell you where the concern is.
Work the items list top to bottom. It’s severity-ordered.
For any criterion that is not green, open its section above. The criterion docs explain what triggered the flag.
Treat abstention items with the same weight as flagged items. “This wasn’t checked” is not “no concern.”

How the rollup works

The rollup is itself a judgment, not a deterministic function. Two runs on the same input may produce slightly different scores or shuffle items of similar severity. The flag and the broad shape of the items list are more stable than the exact numbers. Score: Each criterion’s score, flag, and contributed items are taken into account. Verification and Solution Quality carry the most weight. Red sub-flags lower the ceiling so green elsewhere cannot wash out a serious concern. Abstained criteria do not penalize the score; their weight is removed and the abstention is surfaced as an item. Flag: A single Red sub-flag forces the headline to at least Yellow, even when everything else is Green. Abstained criteria are no-signal — they neither push toward green nor toward red. Items list: De-duplicated when judges flag the same underlying problem from different angles, severity-ordered, and capped at a small number to stay skimmable. Abstentions are surfaced as explicit “this wasn’t checked” items.

Patterns that surprise people

What you see	Why it happens
Headline Yellow with mostly green sub-flags	A single Red sub-flag forces at least Yellow. Two or more Yellow sub-flags also typically produce Yellow.
Headline Yellow on a config-only or docs-only change	Some changes are hard to verify in the usual ways. The rollup keeps the flag at Yellow rather than Red, but the verification gap is still surfaced.
Headline Green with an abstention item visible	A criterion abstained — it had no signal, not “looks fine.”
Headline biased toward Yellow when only a few criteria ran	A result built on two active judges is not the same as one built on six, and the summary line will say so.
Score number drifted on a recompute	The score is a band. A few points of drift between runs is expected. The flag is more stable.

Special cases

When the AI worked autonomously without user follow-up, the rollup does not bias the flag toward Yellow because of it, but does surface the absence of human verification as an item.
When the only concern is verification on a non-testable change (configuration, documentation), the rollup may keep the flag at Yellow rather than Red. The gap is still surfaced.
When few criteria had enough signal to run, the rollup biases one step toward Yellow and the summary line says so.

Troubleshooting

The score didn’t update after I changed something

The score recomputes when a new commit lands on the PR (and when a recognized AI code review bot leaves a comment). It does not recompute on edits-without-commit, on session updates alone, or on time. Push a commit; the recompute follows.

I addressed an item but it’s still on the list

A few possibilities:

The recompute hasn’t happened yet. There is a small delay after a commit so session data finishes syncing.
The fix didn’t address the underlying pattern. Re-read the item against the current diff or session.
The item references a session turn from an earlier session in the PR. Some criteria evaluate sessions individually, and a session’s verdict is computed once when the session arrives and reused for all subsequent commits. The diff-level fix doesn’t change what already happened in the session record.

I disagree with a flag or item

The score is a triage hint, not a verdict. If a Yellow item points at something you’ve decided is fine, that’s a legitimate reviewer call. Disagreeing with the AI Session Score does not require “fixing” anything.

A criterion abstained — what do I do?

Abstention means “I had no signal here,” not “looks fine.” Read the abstention item to see what specifically wasn’t checked. Decide whether you can check it yourself.

The score drifted between runs even though nothing changed

LLM-based evaluation has run-to-run drift on the number. The flag is more stable than the number; treat the score as a band, not a measurement.

A PR has no AI Session Score at all

AI Session Score only runs on PRs with session data from chainloop trace. A PR authored without AI assistance is not scored. Absence of an AI Session Score is not Green — it means the flow does not apply.

A session I expected to see isn’t represented

Sessions are picked up at commit time. A session that arrived after the latest commit will be picked up when the next commit lands.

The headline went Yellow even though most sub-flags are Green

Designed behavior. A single non-green sub-flag is usually enough to keep the headline at Yellow. See Patterns that surprise people.

I want to lower a Yellow without changing the code

The score follows the work, not the other way around. If a Yellow is informational rather than actionable, it’s reasonable to merge anyway — the score does not block merges.

Glossary

Term	Definition
AI Session Score	The per-PR confidence signal. Produces a summary, a flag, a 0-100 score, and an items list.
Criterion	One of the six axes AI Session Score evaluates.
Judge	The LLM-driven evaluator that produces a verdict for one criterion.
Rollup	The LLM-driven evaluator that combines the criterion verdicts into the headline result.
Verdict	A criterion judge’s output: a sub-score, a sub-flag (or abstention), a summary, and evidence.
Session	The transcript of one AI coding session captured by `chainloop trace`: user prompts, AI responses, tool calls, and results.
Session turn	A single message within a session.
Sub-flag	The Red / Yellow / Green flag a single criterion produces.
Headline flag	The overall Red / Yellow / Green at the top of an AI Session Score result.
Items list	The curated, severity-ordered list of things a reviewer should check.
Item	A single entry in the items list, with severity, criterion, summary, and usually a reference.
Reference	The pointer attached to an item: a session turn, a `file:line`, a CI/PR reference.
Evidence	The per-criterion observations a judge produces. The items list is built from this.
Abstention	A criterion’s “I had no signal here” output. Surfaced as an explicit item; not the same as Green.
Recompute	Re-running AI Session Score for a PR. Triggered by a new commit or recognized AI code review bot comment.

Common pattern names

Pattern	Score	Meaning
Drive-by fix	Scope Discipline	A small unrelated change shipped alongside the feature.
Premature done	Alignment	The AI declares the change finished before it actually is.
Claim-vs-reality miss	Alignment	The AI’s summary does not match the diff.
Speculative fix	Solution Quality	A guess-fix shipped alongside or instead of a real fix.
Silent error-swallow	Solution Quality	An error is treated as having a known meaning without verification.
Volunteered bypass	Solution Quality	The AI bypasses a check on its own initiative.
Abstention item	All scores	An explicit “this wasn’t checked” entry.

Getting Started

Concepts

Guides

References

Misc

Documentation Index

​Overview

​What You’ll See

​How to Read a Result

​The Six Criteria

​What AI Session Score Is Not

​When the Score Updates

​Score References

​Context & Planning Score

​What it checks

​What raises the flag

​Why this isn’t green

​Improving the score

​Alignment Score

​What it checks

​What raises the flag

​Why this isn’t green

​Improving the score

​Scope Discipline Score

​What it checks

​What raises the flag

​Why this isn’t green

​Improving the score

​Solution Quality Score

​What it checks

​What raises the flag

​Why this isn’t green

​Improving the score

​Verification Score

​What it checks

​What raises the flag

​Why this isn’t green

​Improving the score

​User Trust Signal Score

​How to use this criterion

​What it checks

​What raises the flag

​Why this isn’t green

​Improving the score on the next session

​Final Scoring

​Reading a result step by step

​How the rollup works

​Patterns that surprise people

​Special cases

​Troubleshooting

​The score didn’t update after I changed something

​I addressed an item but it’s still on the list

​I disagree with a flag or item

​A criterion abstained — what do I do?

​The score drifted between runs even though nothing changed

​A PR has no AI Session Score at all

​A session I expected to see isn’t represented

​The headline went Yellow even though most sub-flags are Green

​I want to lower a Yellow without changing the code

​Glossary

​Common pattern names

Overview

What You’ll See

How to Read a Result

The Six Criteria

What AI Session Score Is Not

When the Score Updates

Score References

Context & Planning Score

What it checks

What raises the flag

Why this isn’t green

Improving the score

Alignment Score

What it checks

What raises the flag

Why this isn’t green

Improving the score

Scope Discipline Score

What it checks

What raises the flag

Why this isn’t green

Improving the score

Solution Quality Score

What it checks

What raises the flag

Why this isn’t green

Improving the score

Verification Score

What it checks

What raises the flag

Why this isn’t green

Improving the score

User Trust Signal Score

How to use this criterion

What it checks

What raises the flag

Why this isn’t green

Improving the score on the next session

Final Scoring

Reading a result step by step

How the rollup works

Patterns that surprise people

Special cases

Troubleshooting

The score didn’t update after I changed something

I addressed an item but it’s still on the list

I disagree with a flag or item

A criterion abstained — what do I do?

The score drifted between runs even though nothing changed

A PR has no AI Session Score at all

A session I expected to see isn’t represented

The headline went Yellow even though most sub-flags are Green

I want to lower a Yellow without changing the code

Glossary

Common pattern names