Rating
How a score is calculated
Every clause produces a verdict and a 0–4 ordinal score. Per-regulation averages roll up; the overall score is the mean across in-scope numeric verdicts.
1 — Per-clause rule scoring
Each clause in the regulation YAML pack defines a list of named deterministic rules with weights:
checker:
deterministic:
- rule: structured_logging_imported
weight: 0.3
- rule: logging_at_tool_call_boundaries
weight: 0.5
- rule: logging_persistent_sink
weight: 0.2Each named rule (implemented in src/pipeline/checkers/rules.ts) takes the RECON signals + file inventory + worktree and returns a score in [0, 1] plus evidence records.
The clause’s raw score is the weight-normalised sum:
rawScore = Σ(rule.weight × rule.score) / Σ(rule.weight)
2 — Raw → ordinal mapping
Raw scores map to a 0–4 ordinal scale. Same scale across every regulation, so per-clause results are comparable.
| Raw range | Ordinal | Verdict | Meaning |
|---|---|---|---|
| ≥ 0.85 | 4 | pass | Strong evidence the clause is met |
| 0.65–0.85 | 3 | pass | Adequate evidence; minor gaps |
| 0.40–0.65 | 2 | partial | Some controls present, incomplete |
| 0.15–0.40 | 1 | fail | Inadequate; significant gaps |
| < 0.15 | 0 | fail | Absent; no evidence found |
3 — Polarity (the Article 5 trick)
Most clauses are positive: more evidence of the control = better. But prohibition clauses like Article 5 (subliminal techniques, predictive policing solely from profiling, untargeted facial scraping, …) are negative: any evidence of the prohibited practice = violation. The pipeline detects these via the score_mapping.pass_default field on the clause and inverts the ordinal mapping:
- Raw
0.00(no signal at all) → ordinal 4 · pass (no prohibited practice detected). - Raw
1.00(signal everywhere) → ordinal 0 · fail (clear violation evidence).
This is why a stock LangChain app correctly passes Art 5(1)(a) “subliminal techniques” — the prompt-pattern detector finds no dark-pattern phrasings, so raw → 0 → pass.
4 — Per-regulation + overall aggregation
- Per-regulation score = arithmetic mean of ordinal scores across that regulation’s in-scope clauses (excludes n/a + external).
- Overall score = arithmetic mean across all in-scope numeric verdicts (treats every clause equally, no inter-regulation weighting in V0).
5 — Special verdicts
- n/a— clause exists but isn’t in scope for this risk classification (e.g. Art 12 logging requires high-risk; on a minimal-risk agent, it’s n/a).
- external — clause is classified as not auditable from code alone (e.g. real-time biometric ID in public spaces: deployment context required). Surfaced to the human reviewer separately.
6 — Future: LLM-judge fallback
For clauses whose raw score lands in the ambiguous band [0.3, 0.7], V1 will invoke an LLM judge with the clause text plus the relevant code chunks. The judge produces an independent verdict; disagreement with the deterministic result downgrades the verdict to partial with a note. The hook is structurally present in the pipeline; the call is stubbed in V0.