โšก AgentSkillHub
Style Theme:
GitHub Login
โ† Back to Home
eval-packCommunity

Measure Mirror

measure-mirror is an open-source toolkit for catching when AI agents "game" experiments. Before an experiment starts, agents commit to a preregistered plan. Afterward, 23 statistical probes inspect the results for signs of cherry-picking, metric hacking, selective reporting, and other forms of evaluation manipulation. The goal is simple: make AI research results more trustworthy and reproducible.

Overview

measure-mirror is an open-source toolkit for catching when AI agents "game" experiments. Before an experiment starts, agents commit to a preregistered plan. Afterward, 23 statistical probes inspect the results for signs of cherry-picking, metric hacking, selective reporting, and other forms of evaluation manipulation. The goal is simple: make AI research results more trustworthy and reproducible.

When to use it

Catch AI evaluation illusions - false positives and false negatives - automatically.

โš™๏ธ Installation Command

Select your coding agent to generate the install script:

npx @agent-skill-hub/cli install measure-mirror

Risk Grade

CScore Grade C

Publisher

๐Ÿ‘ค JHLEE142

Last Updated

๐Ÿ“… 6/16/2026

License

๐Ÿ“„ N/A

Compatibility

github-repository

Repository

๐Ÿ”— GitHub Link
---
name: measure-mirror
description: "Catch AI evaluation illusions - false positives and false negatives - automatically."
compatibility:
  - github-repository
metadata:
  generated: true
  source: "github-repository"
---

# measure-mirror

Generated from repository metadata because this GitHub repository does not include a SKILL.md file.

# ๐Ÿชž Measurement Mirror

<p align="center">
  <img src="docs/measure_mirror_og.png" alt="Measurement Mirror" width="500">
</p>

[![CI](https://github.com/bhyi4/measure-mirror/actions/workflows/ci.yml/badge.svg)](https://github.com/bhyi4/measure-mirror/actions/workflows/ci.yml)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![deps: zero](https://img.shields.io/badge/deps-zero-brightgreen.svg)](pyproject.toml)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/)

**Catch AI evaluation illusions โ€” false positives and false negatives โ€” automatically.**  
Zero training ยท Deterministic ยท Zero-dep core (Python 3.10+ stdlib; `judge` module optional).

> Built while honestly killing our own project.  
> The makers ran it on themselves first. โ†’ [๐Ÿฆ‹ Origin Story](docs/CHRONICLE.md)

**[๐Ÿ“– Full Probe Guide โ†’](docs/GUIDE.md)** โ€” detailed explanations, worked examples, and workflows for all 23 probes

> **๐Ÿชž๐Ÿ”Ž๐Ÿชช New โ€” Mirror Stack** ([`stack/`](stack/)): measure-mirror is the *claims* layer of a
> three-mirror integrity stack for autonomous research agents (claims ยท actions ยท provenance,
> joined by five conventions + one `verify-all` command). Includes a real case study:
> **[an agent that retracted its own experiment before spending a single token](stack/CASE_STUDY_compute_governor.md)**
> โ€” preregistration โ†’ power-check design fix โ†’ adversarial amendments โ†’ prior-art retraction,
> with the actual chain-sealed ledger bundled so you can verify it yourself.
> measure-mirror itself is unchanged (feature-frozen core; the stack adds conventions, not probes).

| Tool | Audits | Question |
|---|---|---|
| ๐Ÿชž **measure-mirror** (you are here) | AI evaluation claims | **Is the claim honest?** |
| ๐Ÿชช [action-mirror](https://github.com/bhyi4/action-mirror) | Agent behaviour | Who did what, **provably**? |
| ๐Ÿ”Ž [provenance-mirror](https://github.com/bhyi4/provenance-mirror) | Content authenticity | Is the **origin** proven? |
| ๐Ÿ‘ [mirror-witness](https://github.com/bhyi4/mirror-witness) | Cross-operator witness board | Who else **witnessed** it? |

๐Ÿ’ฌ **[Discussions](https://github.com/orgs/mirror-stack/discussions)** โ€” questions ยท ideas ยท independent reproductions welcome.

---

## The Problem

AI/ML papers routinely overclaim. The most common failure modes:

| Illusion | How it happens |
|---|---|
| Small-sample mirage | n=9, acc=55.6% reported as breakthrough |
| Post-hoc metric swap | Register accuracy, report the F1 that happened to look better |
| Crippled baseline | Compare against a deliberately weak competitor |
| Data leakage | Train/test overlap inflating every number |
| Scope overreach | "Works on task A" claimed as "general reasoning" |

Measurement Mirror catches these **structurally**, not by opinion.

> **Precisely what it does (and does not).** It runs deterministic checks on the inputs *you
> provide* โ€” it is **input-driven**, not an autonomous flaw-hunter. Arithmetic/statistical probes
> (small-sample CI, GRIM, power, multiple-comparisons) are fully deterministic. But design-flaw
> probes (crippled baseline, gaming, scope) only fire on the baseline / reward terms / scopes you
> *declare* โ€” the tool cannot discover a flaw you hide, and many real catches are still **your
> judgment**, guided by the [discipline](https://github.com/bhyi4/measure-mirror/tree/main/stack/DISCIPLINE.md).
> It makes honesty *provable*; it does not *force* it.

---

## Install

```bash
pip install -e .                        # core (zero deps)
pip install -e ".[mcp]"                 # + MCP server for AI agents
pip install -e ".[judge]"               # + LLM-as-a-Judge runner (openai / anthropic)
pip install -e ".[test]"                # + pytest plugin
pip install -e ".[mcp,judge,test]"      # everything
```

CLI entry point: `mm`  
MCP entry point: `mm-mcp`

---

## Quick Start

### CLI

```bash
# Step 1 โ€” BEFORE running your experiment: seal the criteria
mm register my_model --metric acc --min-n 200 --baseline 0.5 --pass 0.60

# Step 2 โ€” AFTER evaluation: one-command audit
mm my_model                            # auto-loads my_model.json
mm audit my_model --acc 0.72 --n 500
mm audit --file results.json
```

`results.json` format: `{"claim_id": "my_model", "metric": "acc", "acc": 0.72, "n": 500}`

### Python API

```python
from measure_mirror import mm

LEDGER = "mm_ledger.jsonl"

# โ‘  Before experiment โ€” seal criteria (tamper-evident hash)
mm.preregister(LEDGER, "my_model",
               metric="acc", min_n=200, baseline=0.5, pass_threshold=0.60)

# โ‘ก After evaluation โ€” full 7-probe audit at once
findings = mm.full_audit(
    LEDGER, "my_model",
    reported_metric="acc", reported_acc=0.72, n=500,
    baseline=0.5,
    competing_name="strong_baseline", competing_acc=0.68,   # โ‘ก fairness
    reward_terms=["cross_entropy"],                          # โ‘ข gaming check
    train_items=train_set, test_items=test_set,              # โ‘ฃ leakage
    seed_results=[0.70, 0.72, 0.74],                         # โ‘ค multi-seed
    claimed_scope=["reasoning"], tested_scope=["task_a"],    # โ‘ฅ scope
)
mm.report("my_model", findings)

# Individual probes
mm.report("fairness", [mm.baseline_fairness("vs GRU", 0.72, 0.68)])
mm.report("leakage",  [mm.leakage_check(train_items, test_items)])
mm.report("seeds",    [mm.multiseed_check([0.70, 0.72, 0.74], baseline=0.5)])
```

### Regression / Continuous Metrics

For MSE, Pearson r, RMSE, and other non-binary metrics:

```python
findings = mm.continuous_audit(
    LEDGER, "my_regressor",
    reported_metric="mse", reported_value=0.10,
    baseline_value=0.15, n=500,
    higher_better=False,   # lower MSE is better
    std=0.02,              # optional: enables effect-size check
)
mm.report("regression", findings)
```

### pytest Integration (CI Gate)

```python
# conftest.py
pytest_plugins = ["measure_mirror.pytest_plugin"]

# test_eval.py
from measure_mirror import mm
from measure_mirror.pytest_plugin import assert_clean

def test_my_model_is_real():
    findings = mm.audit("ledger.jsonl", "my_model",
                        reported_metric="acc", reported_acc=0.78, n=1000)
    assert_clean(findings)   # FAIL findings โ†’ test fails โ†’ CI goes red
```

---

## Three Verification Tiers

You don't need to memorize 23 probes โ€” there are exactly three ways to use the mirror:

```bash
# FULL โ€” one shot, everything applicable runs automatically
mm verify --file results.json

# GROUP โ€” restrict to verification groups
mm verify --file results.json --groups stats judge
mm verify --list-groups

# INDIVIDUAL โ€” any probe directly (Python)
mm.grim_check(reported_acc=0.33, n=10)
```

```python
from measure_mirror import mm

# FULL: probes activate based on which keys exist in data
findings = mm.verify("ledger.jsonl", {
    "claim_id": "my_model", "acc": 0.72, "n": 500,        # โ†’ ledger + stats
    "seed_results": [0.70, 0.72, 0.74],                    # โ†’ โ‘ค
    "scores": [3, 7, 5, 8, 4],                             # โ†’ judge โ‘ฐ
})

# GROUP: same data, judge group only
findings = mm.verify("ledger.jsonl", data, groups=["judge"])
```

`verify()` runs only the probes whose inputs are present โ€” an empty key fires nothing,
extra keys fire extra probes. The `data` keys are identical to the JSON file format.

## Probes by Verification Group

### `ledger` โ€” pre-registration & ledger integrity

| Probe | # | Catches |
|---|---|---|
| `preregister` / `audit` | โ‘  | Post-hoc metric swap ยท sample underrun ยท ledger tampering |
| `verify_chain` | โ‘  | Deleted/inserted entries ยท ledger tampering |
| `cascade_check` | โ‘ซ | Claim or transitive dependency retracted โ†’ FAIL/WARN stale |

### `stats` โ€” statistical validity

| Probe | # | Catches |
|---|---|---|
| `audit` โ€” Wilson CI | โ‘ฃa | Results indistinguishable from chance (small sample) |
| `audit` โ€” direction | โ‘ฃa | Performance worse than baseline (anti-signal) |
| `multiseed_check` | โ‘ค | Unstable signal / lucky seed |
| `too_good_check` | โ‘ฆ | Suspiciously large ฮ” over baseline |
| `power_check` | โ‘ง | n too small to detect minimum effect (false-negative guard) |
| `multiple_comparisons_check` | โ‘จ | k>1 experiments in ledger โ€” Bonferroni correction alarm |
| `grim_check` | โ‘ฉ | Reported acc ร— n is arithmetically impossible (fabricated value) |

### `design` โ€” experiment-design fairness

| Probe | # | Catches |
|---|---|---|
| `baseline_fairness` | โ‘ก | Crippled / tied / reversed baseline |
| `gaming_check` | โ‘ข | Metric directly in reward/loss (self-fulfilling) |
| `leakage_check` | โ‘ฃa | Trainโˆฉtest data contamination |
| `scope_check` | โ‘ฅ | Claimed scope wider than tested scope |
| `falsifiability_check` | โ‘ช | No kill-condition โ†’ unfalsifiable; kill_threshold triggered โ†’ claim is dead |

### `negative` โ€” Resolved-Negative closure gate

| Probe | # | Catches |
|---|---|---|
| `negative_audit` | โ‘ฌ | Negative conclusion has too few independent angles, unregistered angles, or scope overshoot |

### `judge` โ€” LLM-judge reliability

| Probe | # | Catches |
|---|---|---|
| `judge_consistency_check` | โ‘ญ | LLM judge flip-rate too high โ€” unreliable judge detector |
| `judge_bias_check` | โ‘ฎ | Judge systematically favors position A or B โ€” position bias detector |
| `inter_rater_agreement` | โ‘ฏ | Cohen's ฮบ between two *different* judges below threshold (standalone-only) |
| `judge_score_sanity` | โ‘ฐ | Judge assigns identical/near-identical scores โ€” degenerate distribution |
| `judge_swap_check` | โ‘ฑ | Verdict stays with the slot after ABโ†’BA swap โ€” judge reads position, not content |

### `ranking` โ€” leaderboard integrity

| Probe | # | Catches |
|---|---|---|
| `judge_transitivity_check` | โ‘ฒ | A>B>C>A preference cycles โ€” judge has no consistent quality scale |
| `ranking_stability_check` | โ‘ณ | "A beats B" flips under bootstrap resampling โ€” ranking mirage |

| Utility | Purpose |
|---|---|
| `anchor` | Print tamper-evident ledger snapshot (hash + head seal) to stdout for external archival |
| `calibrate` | Self-test: 5 synthetic known-good/bad cases; confirms tool health |
| `witness` | Execute a command, capture output, seal tamper-evident run record |
| `retract` | Append a chain-linked retraction entry; cascades to dependents via cascade_check |
| `certificate` | Issue a sealed verification certificate: CERTIFIED / WITH-WARNINGS / UNVERIFIED / REJECTED |
| `badge` | Render a certificate as an embeddable markdown / SVG badge (verdict-colored) |

### Chain hash ledger (โ‘  extended)

Every `preregister()` call now embeds the previous entry's seal into the new one
before computing the SHA-256. This makes the ledger tamper-evident end-to-end:

```python
# verify the entire ledger chain at any time
findings = mm.verify_chain("mm_ledger.jsonl")
mm.report("ledger integrity", findings)
```

Catches: entry deletion, entry insertion, content modification.  
**Documented limitation**: complete file deletion + fresh re-registration is not caught โ€”
commit the ledger file to git for that guarantee.

### Power check โ‘ง

```python
# warn when n is too small to detect a real effect
f = mm.power_check(n=50, baseline=0.5, min_detectable_effect=0.05)
# โš ๏ธ  n=50 insufficient to detect ฮ”=+0.05 at 80% power (need nโ‰ฅ388)

# or activate via full_audit
findings = mm.full_audit(LEDGER, "my_model", ..., min_detectable_effect=0.05)
```

### Multiple comparisons โ‘จ

```python
# warn when k>1 experiments share a ledger (Bonferroni)
f = mm.multiple_comparisons_check("mm_ledger.jsonl")
# โš ๏ธ  k=3 experiments โ†’ Bonferroni ฮฑ=0.0167 (not 0.05)

# or activate via full_audit
findings = mm.full_audit(LEDGER, "my_model", ..., check_multiplicity=True)
```

### Falsifiability โ‘ช โ€” the Popper gate

```python
# Register BEFORE the experiment โ€” seal what would kill the claim
mm.preregister("mm_ledger.jsonl", "my_model",
               metric="acc", min_n=200, baseline=0.5, pass_threshold=0.60,
               # human-readable (optional)
               kill_condition="accuracy on held-out test drops below 0.55",
               # structured: auto-evaluated at audit time
               kill_threshold={"metric": "acc", "threshold": 0.55, "direction": "below"})

# After the experiment โ€” audit checks โ‘ช automatically
findings = mm.audit("mm_ledger.jsonl", "my_model",
                    reported_metric="acc", reported_acc=0.50, n=500)
# ๐Ÿ”ด [โ‘ช falsifiability] Kill condition triggered: acc=0.5 < 0.55.
#    Claim 'my_model' is falsified by its own pre-registered criterion.

# Or check standalone
f = mm.falsifiability_check("mm_ledger.jsonl", "my_model", reported_acc=0.50)
```

```bash
# CLI
mm register my_model --metric acc --min-n 200 --baseline 0.5 --pass 0.60 \
  --kill "accuracy < 0.55 on held-out" \
  --kill-threshold 0.55 --kill-direction below
```

Claims registered without any `kill_condition` or `kill_threshold` receive a
`WARN: Unfalsifiable` at audit time โ€” operationalizing Popper's criterion as a
code contract. OSF pre-registration accepts hypotheses; this is the first tool
that also seals the kill condition.

### Retraction cascade โ‘ซ

```python
# Register a claim that depends on prior work
mm.preregister("mm_ledger.jsonl", "model_v2",
               metric="acc", min_n=200, baseline=0.5, pass_threshold=0.60,
               depends_on=["dataset_v1", "baseline_eval"])

# Later: dataset_v1 turns out to have contamination โ€” retract it
mm.retract("mm_ledger.jsonl", "dataset_v1", "train/test overlap discovered")

# cascade_check flags model_v2 as STALE automatically
f = mm.cascade_check("mm_ledger.jsonl", "model_v2")
# โš ๏ธ  [โ‘ซ retraction-cascade] Claim 'model_v2' is STALE: depends (transitively) on
#     retracted claim(s): 'dataset_v1'

# audit() runs cascade_check automatically โ€” only WARN/FAIL are appended
findings = mm.audit("mm_ledger.jsonl", "model_v2",
                    reported_metric="acc", reported_acc=0.72, n=500)
```

```bash
# CLI
mm register model_v2 --metric acc --min-n 200 --baseline 0.5 --pass 0.60 \
  --depends-on dataset_v1 baseline_eval

mm retract dataset_v1 --reason "train/test overlap discovered"
```

The retraction entry is **chain-linked** โ€” deleting it from the ledger breaks the
chain and is detected by `verify_chain()`. Retraction propagates regardless of
publication order: a claim built on a retracted foundation is automatically stale.

### Negative-claim audit โ‘ฌ โ€” the premature-closure gate

A single negative result may reflect a frame flaw, not a universal wall.
`negative_audit` gates "Resolved-Negative" conclusions: you must present
โ‰ฅ `min_angles` independent pre-registered experiments before closing.

```python
# Register each independent angle BEFORE running the experiment
for angle_id in ["oee_test_wave", "oee_test_bilinear", "oee_test_alife"]:
    mm.preregister("mm_ledger.jsonl", angle_id,
                   metric="oee_score", min_n=100, baseline=0.5, pass_threshold=0.0)

# After all angles converge negative โ€” gate the closure
f = mm.negative_audit("mm_ledger.jsonl",
                      angles=["oee_test_wave", "oee_test_bilinear", "oee_test_alife"],
                      min_angles=3)
# โœ… [โ‘ฌ negative-audit] 3/3 independent pre-registered angle(s) verified โ€”
#    negative conclusion is supported.

# Optional scope check: negative conclusion must not overgeneralize
f = mm.negative_audit("mm_ledger.jsonl",
                      angles=["oee_test_wave", "oee_test_bilinear", "oee_test_alife"],
                      conclusion_scope=["all_substrates"],   # claimed scope
                      tested_scope=["in_silico"])            # actually tested
# ๐Ÿ”ด [โ‘ฌ negative-audit] conclusion scope includes untested domain(s): ['all_substrates'].
```

```bash
# CLI
mm negative --angles oee_test_wave oee_test_bilinear oee_test_alife --min-angles 3

# Or activate inside full_audit
findings = mm.full_audit(LEDGER, "main_claim", ..., angles=["a1", "a2", "a3"])
```

| Condition | Level |
|---|---|
| `len(angles) < min_angles` | FAIL โ€” premature closure |
| Any angle not pre-registered | FAIL โ€” unverifiable evidence |
| `conclusion_scope โŠ„ tested_scope` | FAIL โ€” scope overshoot |
| A registered angle is retracted | WARN โ€” weakened case |
| All checks pass | OK |

### LLM-as-a-Judge probes โ‘ญโ‘ฎโ‘ฏโ‘ฐ

LLM judges introduce their own failure modes that numeric metrics don't catch.
The four judge probes audit the **judge itself**, not just the model being evaluated.

```bash
pip install "measure-mirror[judge]"   # adds openai and anthropic
```

```python
from measure_mirror.judge import anthropic_judge, openai_judge, judge_run

# Build a judge callable (pairwise A-vs-B mode)
judge_fn = anthropic_judge(model="claude-opus-4-8")
# or: judge_fn = openai_judge(model="gpt-4o")

# Each item: {"prompt": str, "a": str, "b": str}  (pairwise)
#            {"prompt": str, "response": str}       (rating, pairwise=False)
pairs = [
    {"prompt": "Summarize quantum entanglement",
     "a": "candidate_A output", "b": "candidate_B output"},
    ...
]

# judge_run calls judge_fn runsร—len(items) times,
# fires โ‘ญโ‘ฎโ‘ฏโ‘ฐ(โ‘ฑ) automatically, and seals a chain-linked entry.
result = judge_run("mm_ledger.jsonl", "my_llm_eval",
                   judge_fn=judge_fn,
                   items=pairs,
                   runs=2,               # run each item twice โ†’ โ‘ญ consistency
                   pairwise=True,        # A-vs-B โ†’ โ‘ฎ bias check
                   swap_positions=True)  # extra ABโ†’BA pass โ†’ โ‘ฑ swap test

for f in result["findings"]:
    print(f"  {f.level}  [{f.probe}]  {f.msg}")
```

**The checks triggered by `judge_run`:**

| Probe | Catches |
|---|---|
| `judge_consistency_check` โ‘ญ | Judge gives different verdict on re-run (stochastic / unreliable) |
| `judge_bias_check` โ‘ฎ | Judge systematically favors position A or B regardless of content |
| `judge_score_sanity` โ‘ฐ | Judge assigns identical / near-identical scores to everything |
| `judge_swap_check` โ‘ฑ | Verdict stays with the slot after ABโ†’BA swap (content-blind judge) |

โ‘ฏ `inter_rater_agreement` is **standalone-only**: it compares two genuinely
different judges; re-running the same judge is already covered by โ‘ญ.

Unparseable judge responses score -1 and are **excluded from all probes**; a
`judge-parse` WARN fires when the failure rate exceeds 10% (FAIL when nothing parsed).

**Why โ‘ฑ matters** โ€” a deterministic judge that never reads the responses passes
โ‘ญ (perfectly consistent), โ‘ฎ (balanced win-rate), โ‘ฏ (ฮบ=1.0), and โ‘ฐ (varied scores).
Only swapping A and B exposes it: a content-driven judge must invert its verdict,
a content-blind judge keeps choosing the same slot. Run the demo:

```bash
python examples/demo_judge.py   # no API key needed โ€” mock judges
```

**Standalone usage (bring-your-own scores):**

```python
from measure_mirror import mm

# โ‘ญ consistency โ€” did the judge flip its verdict?
score_pairs = [(1, 1), (0, 0), (1, 0), (0, 0), (1, 1)]  # (run1, run2) per item
f = mm.judge_consistency_check(score_pairs, flip_threshold=0.20)
# โœ…  flip rate 20.0% โ‰ค 20.0% (1/5 flips). Consistent.

# โ‘ฎ position bias โ€” does A always win?
results = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]  # 0=A won, 1=B won
f = mm.judge_bias_check(results, bias_threshold=0.60)
# ๐Ÿ”ด  Position A win rate 90.0% > 60.0%. Strong position bias detected.

# โ‘ฏ inter-rater โ€” Cohen's ฮบ between two raters / runs
matrix = [(1, 1), (0, 0), (1, 1), (0, 1), (1, 0), (0, 0)]
f = mm.inter_rater_agreement(matrix, min_kappa=0.40)
# โš ๏ธ  Cohen's ฮบ=0.333 < 0.40 โ€” fair agreement only.

# โ‘ฐ score sanity โ€” degenerate distribution?
scores = [8, 8, 8, 8, 8, 8, 8, 7, 8, 8]  # 90% are 8
f = mm.judge_score_sanity(scores)
# โš ๏ธ  90% of scores are '8' โ€” near-degenerate distribution.

# โ‘ฑ position-swap โ€” does the verdict follow content or slot?
forward = [0, 1, 0, 1, 0]   # winners in (A, B) order
swapped = [0, 1, 0, 1, 0]   # winners after ABโ†’BA swap โ€” identical = locked!
f = mm.judge_swap_check(forward, swapped)
# ๐Ÿ”ด  Position-lock rate 100.0% > 65.0%. Judge is reading position, not content.

# โ‘ฒ transitivity โ€” does the judge have a consistent quality scale?
matches = [("gpt", "claude", 0), ("claude", "llama", 0), ("llama", "gpt", 0)]
f = mm.judge_transitivity_check(matches)
# ๐Ÿ”ด  Preference cycle detected: gpt > claude > llama > gpt.
#     Any leaderboard from these verdicts is order-dependent.

# โ‘ณ ranking stability โ€” does "A beats B" survive resampling?
f = mm.ranking_stability_check(scores_model_a, scores_model_b)
# ๐Ÿ”ด  Ranking 'A > B' survives only 64.2% of 1000 bootstrap resamples (n=7).
#     The ranking is noise โ€” indistinguishable from a tie.
```

```bash
# CLI: audit pre-collected judge scores from a JSON file
mm judge --file judge_scores.json
# keys: score_pairs / pairwise_results / ratings_matrix / scores /
#       forward_results + swapped_results / matches / scores_a + scores_b
```

### Certificate ๐Ÿ“œ โ€” one sealed verdict per claim

`certificate()` collapses the full integrity state of a claim into a single
verifiable artifact you can embed in a paper, README, or release notes:

```python
# Structural certificate (prereg seal + chain + retraction status)
cert = mm.certificate("mm_ledger.jsonl", "my_model")

# Full certificate โ€” fold audit findings in
findings = mm.audit("mm_ledger.jsonl", "my_model",
                    reported_metric="acc", reported_acc=0.72, n=500)
cert = mm.certificate("mm_ledger.jsonl", "my_model", findings=findings)
# {"verdict": "CERTIFIED", "prereg_seal": "6c802655ab095e8b",
#  "anchor_hash": "sha256...", "findings": {"ok": 4, "warn": 0, "fail": 0},
#  "seal": "9d1e83a4b72f0c5e", ...}
```

```bash
# CLI
mm certify my_model --pretty                  # structural only
mm certify my_model --acc 0.72 --n 500        # + audit findings folded in
mm certify my_model | gh gist create -        # publish externally
```

| Verdict | Meaning |
|---|---|
| `CERTIFIED` | Pre-registered, chain intact, not retracted, no FAIL/WARN findings |
| `CERTIFIED-WITH-WARNINGS` | Valid but has stale dependencies or WARN findings |
| `UNVERIFIED` | No pre-registration exists โ€” nothing to certify against |
| `REJECTED` | Chain broken, seal tampered, retracted, or FAIL findings |

The certificate embeds the ledger's `anchor_hash`, so it attests to **one specific
ledger state** โ€” and the certificate itself is sealed, so any field edit is detectable.

**Badge ๐Ÿท โ€” embed the verdict in your README:**

```bash
mm certify my_model --badge markdown >> README.md   # shields.io badge
mm certify my_model --badge svg > badge.svg          # offline self-contained SVG
```

```python
cert = mm.certificate("mm_ledger.jsonl", "my_model")
print(mm.badge(cert))                 # markdown (default)
print(mm.badge(cert, fmt="svg"))      # SVG with cert seal in the tooltip
# ![๐Ÿชž my_model: CERTIFIED](https://img.shields.io/badge/๐Ÿชž_my__model-CERTIFIED-brightgreen)
```

Badge color follows the verdict: CERTIFIED = green ยท WITH-WARNINGS = yellow ยท
UNVERIFIED = grey ยท REJECTED = red. The SVG variant embeds the certificate seal
and anchor-hash prefix in its tooltip, making the badge traceable back to the
exact sealed certificate it renders.

### Anchor โŽˆ

```bash
# Print tamper-evident ledger snapshot to stdout โ€” pipe wherever you trust
mm anchor                              # compact JSON (default)
mm anchor --pretty                     # human-readable

# External archival examples (zero new dependencies)
mm anchor >> ~/Dropbox/mm_anchors.jsonl          # local backup
mm anchor | gh gist create -                     # GitHub Gist
mm anchor | aws s3 cp - s3://bucket/anchor.json  # S3
```

```python
a = mm.anchor("mm_ledger.jsonl")
# {"_type": "anchor", "ts": "...", "entry_count": 3,
#  "head_seal": "a3b9f2c1", "anchor_hash": "sha256hex...", "chain_ok": true}
```

The `anchor_hash` (full SHA-256 of the ledger file) detects even **complete file replacement** โ€” the one attack that chain hashes alone cannot catch. Save it externally before publishing results.

### Calibrate + Witness run

```bash
# Verify the mirror itself is working correctly
mm calibrate
# โœ… [โš™ calibrate] 5/5 synthetic cases correct โ€” mirror is calibrated.

# Witness-execute a command: calibrate first, then run and seal the record
mm run my_model -- python evaluate.py --model my_model
# โœ… [โš™ calibrate] 5/5 synthetic cases correct โ€” mirror is calibrated.
#
# ๐ŸŽฌ Witnessed: my_model
#    Command:     python evaluate.py --model my_model
#    Started:     2026-06-11T12:00:00Z
#    Ended:       2026-06-11T12:00:03Z
#    Exit code:   0  (ok)
#    Output hash: a3b9f2c1e8d74f6a
#    Prev seal:   6c802655ab095e8b
#    Seal:        9d1e83a4b72f0c5e
```

```python
# Python API
findings = mm.calibrate()
mm.report("Mirror calibration", findings)

entry = mm.witness("mm_ledger.jsonl", "my_model",
                   ["python", "evaluate.py", "--model", "my_model"])
# entry["output_hash"] changes if the script output ever changes
```

### GRIM โ‘ฉ

```python
# catch arithmetically impossible accuracy values
f = mm.grim_check(reported_acc=0.33, n=10)
# ๐Ÿ”ด  acc=0.33 is arithmetically impossible for n=10.
#     No integer k satisfies round(k/10, 2) = 0.33.
#     (candidates: k=3 โ†’ 0.3, k=4 โ†’ 0.4). Fabricated value or mis-reported n.

# GRIM runs automatically inside audit() โ€” FAIL is appended to findings
findings = mm.audit(LEDGER, "my_model", reported_metric="acc", reported_acc=0.33, n=10)
```

---

## MCP Server โ€” AI Agent Integration

Any MCP-compatible AI (Claude Code, Cursor, Windsurf, โ€ฆ) can call Measurement Mirror directly mid-conversation.

### Setup

```bash
pip install "measure-mirror[mcp]"
```

**Claude Code** โ€” add to `.mcp.json` in your project root:

```json
{
  "mcpServers": {
    "measure-mirror": {
      "command": "python",
      "args": ["-m", "measure_mirror.mcp_server"],
      "cwd": "/path/to/measure-mirror"
    }
  }
}
```

**Other MCP clients** โ€” run `mm-mcp` as the stdio server command.

All 23 probes + 6 utilities + the `mm_verify` umbrella are exposed as MCP tools:  
`mm_verify` (full / group-filtered) ยท  
`mm_register` ยท `mm_verify_chain` ยท `mm_audit` ยท `mm_continuous_audit` ยท `mm_full_audit` ยท  
`mm_baseline_fairness` ยท `mm_gaming_check` ยท `mm_multiseed_check` ยท `mm_scope_check` ยท  
`mm_too_good_check` ยท `mm_power_check` ยท `mm_multiple_comparisons_check` ยท `mm_grim_check` ยท  
`mm_falsifiability_check` ยท `mm_cascade_check` ยท `mm_negative_audit` ยท  
`mm_judge_consistency_check` ยท `mm_judge_bias_check` ยท `mm_inter_rater_agreement` ยท  
`mm_judge_score_sanity` ยท `mm_judge_swap_check` ยท `mm_judge_transitivity_check` ยท  
`mm_ranking_stability_check` ยท  
`mm_anchor` ยท `mm_calibrate` ยท `mm_witness` ยท `mm_retract` ยท `mm_certificate` ยท `mm_badge`

---

## Real Catches โ€” Dog-Fooding

We ran Measurement Mirror on our own AI research before publishing. It caught:

```
๐Ÿชž Audit: ZERO "55.6% Best" claim
   ๐Ÿ”ด FAIL
   ๐Ÿ”ด [โ‘ฃa small-sample CI] n=9, acc=0.556 โ†’ 95%CI [0.267, 0.811] โŠƒ baseline(0.5)
       Statistically indistinguishable from chance.
   ๐Ÿ”ด [โ‘  pre-registration(min_n)] reported n=9 < registered min_n=200. Undersized.
   ๐Ÿ”ด [โ‘  pre-registration(metric swap)] reported 'best_of_9' โ‰  registered 'acc_full_balanced'
       Post-hoc metric swap detected. (seal=6c802655ab095e8b)

๐Ÿชž Audit: Field candidate 5 (control sim)
   ๐Ÿ”ด FAIL
   ๐Ÿ”ด [โ‘ก fair baseline] Field 0.996 โ‰ˆ GRU-ODE 0.998 (ฮ”+0.002 < 0.01). Tied โ€” no genuine advantage.
```

Run the demos yourself:

```bash
python examples/quickstart.py    # happy path: honest researcher
python examples/demo_zero.py     # ZERO 55.6% mirage (our own project killed)
python examples/demo_field.py    # Field candidate false positives
```

---

## Project Structure

```
measure-mirror/
โ”œโ”€โ”€ measure_mirror/
โ”‚   โ”œโ”€โ”€ mm.py              # verify() + 20 probes + CLI + DB lookup (zero deps)
โ”‚   โ”œโ”€โ”€ mcp_server.py      # MCP server โ€” 30 tools (pip install .[mcp])
โ”‚   โ”œโ”€โ”€ judge.py           # LLM-as-a-Judge runner (pip install .[judge])
โ”‚   โ””โ”€โ”€ pytest_plugin.py   # assert_clean() for CI gates
โ”œโ”€โ”€ docs/
โ”‚   โ”œโ”€โ”€ GUIDE.md           # full per-probe reference (English)
โ”‚   โ””โ”€โ”€ GUIDE_KO.md        # full per-probe reference (Korean)
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ quickstart.py      # happy path demo
โ”‚   โ”œโ”€โ”€ demo_zero.py       # ZERO false-positive (dog-food)
โ”‚   โ”œโ”€โ”€ demo_field.py      # Field false-positive (dog-food)
โ”‚   โ”œโ”€โ”€ demo_judge.py      # LLM-judge failure modes (no API key needed)
โ”‚   โ””โ”€โ”€ mcp_example.py     # MCP tool usage reference
โ”œโ”€โ”€ db/                    # local memory, split by who produced the record
โ”‚   โ”œโ”€โ”€ README.md              the measured/ vs curated/ distinction
โ”‚   โ”œโ”€โ”€ measured/             โ† measure-mirror's own output (quantitative)
โ”‚   โ”‚   โ”œโ”€โ”€ baselines.json         task-level fair baselines
โ”‚   โ”‚   โ””โ”€โ”€ reproductions.jsonl    reproductions (verdict auto-judged)
โ”‚   โ””โ”€โ”€ curated/              โ† human-curated (qualitative)
โ”‚       โ”œโ”€โ”€ self_catches.jsonl          false positives on ourselves
โ”‚       โ”œโ”€โ”€ false_negative_guards.jsonl false negatives re-checked
โ”‚       โ”œโ”€โ”€ gaming_patterns.json        gaming signatures
โ”‚       โ”œโ”€โ”€ contamination.jsonl         data leakage found
โ”‚       โ””โ”€โ”€ research_closures.jsonl     qualitative negative conclusions
โ””โ”€โ”€ tests/
    โ”œโ”€โ”€ test_mm.py         # 145 tests for core probes, CI-enforced
    โ”œโ”€โ”€ test_judge.py      # 17 tests for judge.py module
    โ””โ”€โ”€ test_sync.py       # sync gate: probe โ†” MCP โ†” tests โ†” README โ†” exports โ†” version
```

---

## Local Memory (`db/`)

`db/` is your **local memory of past audits** โ€” not a shared/crowd database.
We tried the "CVE / shared signature" framing and **for us it didn't earn its
keep** (this is a scoped observation, not a universal law): contributing would
mean publishing *your own research that was wrong* (`self_catches`) or *a peer's
failed reproduction* (`reproductions`), which runs into the trust โŠฅ reputation
dilemma. A team with the right incentives might sustain a shared DB; we didn't,
and the value below needs no sharing anyway.

The value that *does* hold needs no sharing at all: **warn future-you about a
pattern past-you already got burned by.** It works regardless of how private the
data is โ€” it never leaves your machine.

`db/` is split by **who produced the record**, so the two kinds are never
confused (see [`db/README.md`](db/README.md) for the full structure):

### `db/measured/` โ€” what measure-mirror produces (quantitative)

Verdicts computed by the tool itself; re-running on the same numbers reproduces
the same verdict exactly. These are wired into the audit loop and grow only via
`record_reproduction()`.

| File | How it's used |
|---|---|
| `measured/baselines.json` | `audit(task="musr")` auto-fetches the fair baseline |
| `measured/reproductions.jsonl` | `audit(task=...)` warns on a prior reproduction failure; `record_reproduction(...)` appends new ones (verdict auto-judged from Wilson CI) |

```python
# memory grows: you reproduce a claim and record the result
mm.record_reproduction("musr", claim="ZERO 55.6%", acc_claimed=0.556,
                       n_claimed=9, acc=0.385, n=1050, note="collapsed at scale")
# โ†’ verdict auto-judged FAIL, appended to db/measured/reproductions.jsonl

# later: any audit on the same task surfaces it automatically
mm.audit("ledger.jsonl", "new_claim", reported_metric="acc",
         reported_acc=0.62, n=120, task="musr")
# โš ๏ธ [โš™ prior-reproduction] task 'musr' has a prior reproduction failure:
#    'ZERO 55.6%' claimed 0.556 (n=9) โ†’ reproduced 0.385 (n=1050). collapsed at scale
```

### `db/curated/` โ€” what we wrote by hand (qualitative)

Our **catch log** and research closures โ€” human-curated, *not* the tool's
automatic output. Searchable via `catch_history()`, but **not** auto-wired into
`audit()` (matching is fuzzy text, not a clean `task` key, so auto-warning would
mean false positives).

| File | `kind` | Catch history of |
|---|---|---|
| `curated/self_catches.jsonl` | `self_catch` | false *positives* you flagged on yourself |
| `curated/false_negative_guards.jsonl` | `false_negative` | false *negatives* you re-checked |
| `curated/gaming_patterns.json` | `gaming` | gaming / mirage signatures you've seen |
| `curated/contamination.jsonl` | `contamination` | data leakage you found |
| `curated/research_closures.jsonl` | `closure` | qualitative negative conclusions (no `acc`/`n` โ€” **not** a tool verdict) |

```python
mm.catch_history(db_dir="db")                 # all curated records
mm.catch_history(kind="gaming", db_dir="db")  # the gaming-signature catalog
mm.catch_history(source="fm_cde_pixel_feasibility")  # records from one arc
```

**Why the split is honest**: calling `db/` as a whole "measure-mirror history"
would over-claim โ€” only `measured/` is that. Every `measured/` record was
cross-checked: feeding its `(acc, n)` back through the tool's own Wilson-CI logic
reproduces the recorded verdict with **0 mismatches**.

---

## Design Principles

- **Zero-dep core** โ€” pure Python stdlib. The optional `judge` module adds openai/anthropic for LLM-as-a-Judge; nothing else, nothing in the core.
- **Bidirectional** โ€” catches false *positives* **and** false *negatives*. Premature negative closures are also illusions.
- **Tamper-evident pre-registration** โ€” SHA-256 seal on first write. Re-registration is silently ignored. Ledger tampering is detected on every audit.
- **Independent probes** โ€” each check is a standalone function. Add new ones without touching existing code.
- **Adversarial by default** โ€” "too good to be true" is flagged before you believe it.

---

## Contributing

New probes and baseline contributions are welcome.

1. Fork โ†’ branch โ†’ PR
2. **`db/baselines.json`**: shareable task baselines (not your private failures)
3. **New probes**: add the function to `mm.py` + tests in `tests/test_mm.py`
4. CI must stay green: `pytest tests/`

---

## License

[Apache 2.0](LICENSE)

---

**[ํ•œ๊ตญ์–ด README โ†’](README_KO.md)**