smokingmirror.ai

preference observatory · 2026-06-10 14:05 UTC · 863 probe pairs · 6 envelopes · 2 orderings · 43 model nodes  ·  thinking disagreements ▸  ·  refusals ▸  ·  none-cases ▸  ·  hijack corpus ▸

distance — all models

(blank) row/column = distance to the all-refusal fingerprint. Hamming counts every mismatch including refusals — bluer = closer, redder = further. Violent disagreement shows the percentage of mutual-opinion positions where the two models took opposite sides (A-vs-B head-to-head): violent_count / n_both. Refusals don't contribute, and the rate is undefined ("—") when the two models had no probes in common where both committed. The two metrics tell different stories: Hamming captures total fingerprint difference, violent rate captures real ideological clash where both models had something to say.

hamming distance — by provider

Click any cell to compare two models. Blue = similar. Red = different.

sharpness matrix — how strongly does each model care?

Per (model, pair) preference strength, not just direction. raw score = (A votes − B votes) / total presentations, range [−1, +1]; counts every (envelope × sample × order) presentation including position-bias votes. real-pref score = (real_A − real_B) / cells where AB and BA agreed, range [−1, +1]; the position-bias-filtered version. The gap between the two reveals position-driven picks. Cells show signed integer count (e.g. +18 means 18 more A picks than B picks). Color intensity = strength of opinion (|score|), regardless of direction — the sign in the cell text shows which side it landed on; dark = uncommitted. Rows sorted by drift across the lineage (largest sharpness variance first). Click a row for the full per-model breakdown.
  

preference strings

picks A picks B no preference

intransitive cycles ⚡

all preferences

Left: sorted by first term. Right: sorted by second term. Read down a group to see how one concept fares against everything.

methodology

Each probe pair (e.g. freedom / censor) is presented to every model node in one or more format envelopes and in both AB and BA word orderings.

A model's preference for a pair is real only when it picks the same word across both orderings within an envelope. AB/BA disagreement is recorded as none with reason order_bias — the model is reacting to position, not the words. A pair's cross-envelope preference is the strict majority of its real per-envelope preferences.

format envelopes

SmokingMirror v1-v3 used six envelopes that work robustly across model generations: english_strict, english_casual, python_typed, json_schema, french_casual, chinese_casual. The 2026-06-10 Fable launch tested 23 additional bare-grammar envelopes (X | Y, X / Y, X or Y, etc.) which proved unsuitable for adaptive-thinking models — Fable interprets them as topic-comparison prompts and writes essays instead of picking. That data is preserved as the Fable Hijack Corpus, a separate research artifact.

sampling protocols

<model> — historical single-shot at temperature=0, deterministic.
<model>@YYYY-MM-DD — rerun on that date, distributional (5 samples, default sampling — required for Opus 4.7+ where the API rejects temperature=0).
<model>-t0@YYYY-MM-DD — rerun on that date at temperature=0 (1 sample, deterministic).
<model>@EFFORT-YYYY-MM-DD — Fable 5+. Effort axis: 00=low, 10=medium, 20=high, 30=xhigh, 40=max. Numeric prefix for stable sort order.

distance metrics

Hamming counts every position where two fingerprints differ — including refusals vs commitments. It's the total fingerprint difference.
Violent disagreement is the fraction of head-to-head positions where both models committed to an answer (A or B, not refused) but picked opposite sides. It captures real ideological clash. Undefined ("—") when the two models had no probes in common where both committed.

intransitive cycles

A 3-cycle A > B > C > A in a model's real preferences indicates the model has an inconsistent ordering — it prefers A to B, B to C, but then prefers C to A. Cycles are searched across the whole preference graph, not within hand-curated triangle groups. A larger cycle count signals a model whose preferences don't form a coherent total ordering.

drift and retesting

The @DATE is the experiment start, not the lab's release date. Re-running the same model on a later date produces a sibling node, and the Hamming distance between siblings measures drift between micro-updates we can't see from outside.