Rating

How a score is calculated

Every clause produces a verdict and a 0–4 ordinal score. Per-regulation averages roll up; the overall score is the mean across in-scope numeric verdicts.

1 — Per-clause rule scoring

Each clause in the regulation YAML pack defines a list of named deterministic rules with weights:

checker:
  deterministic:
    - rule: structured_logging_imported
      weight: 0.3
    - rule: logging_at_tool_call_boundaries
      weight: 0.5
    - rule: logging_persistent_sink
      weight: 0.2

Each named rule (implemented in src/pipeline/checkers/rules.ts) takes the RECON signals + file inventory + worktree and returns a score in [0, 1] plus evidence records.

The clause’s raw score is the weight-normalised sum:

rawScore = Σ(rule.weight × rule.score) / Σ(rule.weight)

2 — Raw → ordinal mapping

Raw scores map to a 0–4 ordinal scale. Same scale across every regulation, so per-clause results are comparable.

Raw range	Ordinal	Verdict	Meaning
≥ 0.85	4	pass	Strong evidence the clause is met
0.65–0.85	3	pass	Adequate evidence; minor gaps
0.40–0.65	2	partial	Some controls present, incomplete
0.15–0.40	1	fail	Inadequate; significant gaps
< 0.15	0	fail	Absent; no evidence found

3 — Polarity (the Article 5 trick)

Most clauses are positive: more evidence of the control = better. But prohibition clauses like Article 5 (subliminal techniques, predictive policing solely from profiling, untargeted facial scraping, …) are negative: any evidence of the prohibited practice = violation. The pipeline detects these via the score_mapping.pass_default field on the clause and inverts the ordinal mapping:

Raw 0.00 (no signal at all) → ordinal 4 · pass (no prohibited practice detected).
Raw 1.00 (signal everywhere) → ordinal 0 · fail (clear violation evidence).

This is why a stock LangChain app correctly passes Art 5(1)(a) “subliminal techniques” — the prompt-pattern detector finds no dark-pattern phrasings, so raw → 0 → pass.

4 — Per-regulation + overall aggregation

Per-regulation score = arithmetic mean of ordinal scores across that regulation’s in-scope clauses (excludes n/a + external).
Overall score = arithmetic mean across all in-scope numeric verdicts (treats every clause equally, no inter-regulation weighting in V0).

5 — Special verdicts

n/a— clause exists but isn’t in scope for this risk classification (e.g. Art 12 logging requires high-risk; on a minimal-risk agent, it’s n/a).
external — clause is classified as not auditable from code alone (e.g. real-time biometric ID in public spaces: deployment context required). Surfaced to the human reviewer separately.

6 — Future: LLM-judge fallback

For clauses whose raw score lands in the ambiguous band [0.3, 0.7], V1 will invoke an LLM judge with the clause text plus the relevant code chunks. The judge produces an independent verdict; disagreement with the deterministic result downgrades the verdict to partial with a note. The hook is structurally present in the pipeline; the call is stubbed in V0.