# VIVID for Walmart · Judge Analysis

**Cross-family validation of the journey-meta-analyst layer · three judge families across three vendors · self-preference bias check · methodology critique and forward path**

---

## §1 · Why this matters

A pairwise AI audit's defensibility lives or dies at the judging layer. The conversations, the metrics, the trajectory data — all of that is evidence. The judge is what converts evidence into a verdict. A Walmart PM evaluating this audit reasonably asks: *"Is the verdict trustworthy? What if a different judge had read the same transcripts? Did the judge family have a bias toward one of the contender models?"*

This document is the substantive answer to those questions for the V3 Walmart audit. We document the four judging layers, the persona-voice vs analyst-voice split that drives some of the most interesting findings, the cross-family validation we ran across Google Gemini + OpenAI GPT-5.4 + Anthropic Claude-opus-4-6 as judges, the self-preference bias analysis we conducted to rule out family-affinity contamination, and the methodology critique that informs how V4 should be designed differently.

The headline is unambiguous: **3 independent judge families agreed on the journey-level verdict for 9 of 12 personas, with the remaining 3 resolved by 2-of-3 majority. No self-preference bias detected. Both Gemini fallback records were resolved by the two new judges with substantive reasoning.** The single-judge weakness from the original audit is closed.

---

## §2 · The four judging layers

The pairwise judging pipeline operates at four progressively-broader scopes. Each scope answers a different question and uses different evidence weighting.

### Layer 1 · Milestone (per-checkpoint)

**Voice:** persona.
**Prompt template:** `pair_prompt` from `scripts/run_pairwise_v3.py` (line 348).
**Evidence scope:** one specific checkpoint within a single scenario (e.g., `need_understanding`, `cart_planning`, `budget_substitution`, `checkout_readiness`, `closing_trust_effort`).
**Output schema:**
```
{
  "winner": "model_a" | "model_b",
  "margin": "negligible" | "small" | "medium" | "large",
  "confidence": 0.0-1.0,
  "reason": "<2-4 sentences>",
  "evidence_quotes": [...],
  "preference_dimension": "<trust | effort | accuracy | constraint fidelity | emotional fit | actionability>",
  "losing_failure_mode": "<what the losing model did worse>"
}
```

The milestone layer is the finest-grained verdict — it asks the simulated customer to compare what each model did at one specific step. 240 milestone judgments per run (12 PSAs × 4 scenarios × 5 milestones each ÷ 2 (averaged across models in pair)).

### Layer 2 · Session (whole conversation)

**Voice:** persona.
**Prompt template:** Same `pair_prompt`, called with `kind="session_pairwise"`.
**Evidence scope:** entire conversation transcript for one scenario.
**Output schema:** Same as milestone.

The session layer asks: *"As this customer, was the whole conversation with Model A better or worse than the whole conversation with Model B?"* 48 session judgments per run (12 PSAs × 4 scenarios per PSA).

### Layer 3 · Session meta-analyst (second-pass per session)

**Voice:** analyst — independent UX quality reviewer.
**Prompt template:** `analyst_prompt` (line 398).
**Evidence scope:** Same session transcript, plus the persona's session-level verdict.
**Output schema:**
```
{
  "objective_winner": "model_a" | "model_b",
  "margin": ...,
  "confidence": 0.0-1.0,
  "analyst_reason": "<2-4 sentences>",
  "task_completion": 0.0-1.0,
  "constraint_fidelity": 0.0-1.0,
  "accuracy": 0.0-1.0,
  "emotional_fit": 0.0-1.0,
  "efficiency": 0.0-1.0,
  "agreement_with_psa": true/false,
  ...
}
```

The analyst voice is explicitly distinct from the persona voice. The prompt opens: *"Compare objective task quality, not just the persona's subjective preference."* It scores the session on five UX dimensions independently of the customer's emotional response.

### Layer 4 · Journey (full 4-session arc)

**Voice:** persona.
**Prompt template:** `journey_prompt` (line 447).
**Evidence scope:** All 4 sessions for the persona across both models, with metrics summaries, diaries, and the session-level verdicts.
**Output schema:**
```
{
  "winner": "model_a" | "model_b",
  "margin": ...,
  "confidence": 0.0-1.0,
  "reason": "<2-4 sentences>",
  "would_keep_using_walmart_ai": true/false,
  "trust_recovery": "better" | "worse" | "mixed",
  "compounding_failures": [...],
  "best_moment": "<brief evidence>",
  "worst_moment": "<brief evidence>",
  "evidence_quotes": [...]
}
```

The journey layer asks: *"Across your whole experience — shopping then support — which model gave you the better end-to-end Walmart experience and which would you rather use again?"* 12 journey judgments per run.

### Layer 5 · Journey analyst (meta-pass)

**Voice:** analyst.
**Prompt template:** `journey_analyst_prompt` (in `run_journey_analyst_v3.py`, line 117).
**Evidence scope:** All 4 sessions + the persona's journey verdict + the session-level analyst verdicts.
**Output schema:** Includes objective_winner, margin, confidence, analyst_reason, role_fidelity, task_completion, constraint_fidelity, support_quality, trust_recovery, risk_notes, agreement_with_psa, alignment_status.

The journey meta-analyst is the audit's deepest judging surface — it adds dimensions the session meta-analyst doesn't have (specifically `role_fidelity` and journey-level `trust_recovery` and `compounding_failures`). It also reads the persona's own journey verdict as evidence and explicitly compares against it. 12 journey meta-analyst judgments per run.

### Why two voices, two zoom levels

The four layers actually decompose into **2 zoom levels × 2 voices**:

|  | Session zoom | Journey zoom |
|---|---|---|
| **Persona voice** | "As me, which chat did I prefer?" | "As me, which whole experience did I prefer?" |
| **Analyst voice** | "As an analyst, which session had higher task quality?" | "As an analyst, which journey shows better role fidelity, consistency, and risk profile?" |

The persona voice rewards in-the-moment empathy and varied phrasing. The analyst voice rewards consistency, role fidelity, constraint adherence, and risk avoidance. **These can produce different verdicts on the same evidence.** F-09 in the integrated analysis (EDGE-02 Marisol case) is the canonical example: PSA voice voted gpt-5.4 3-1 across sessions; analyst voice gave the journey to Claude.

---

## §3 · The persona-voice vs analyst-voice split — the load-bearing methodology distinction

This distinction is the audit's most important methodological choice and the one most likely to be misunderstood by a first-time reader. Walmart PMs reading the v7 page will see findings where the per-session and per-journey verdicts disagree, where the matrix winner differs from the session majority, where some PSAs have unanimous 3-judge agreement and others have 2-of-3 splits. Almost all of these patterns trace back to the two voices answering different questions about the same evidence.

### What each voice optimizes for

The **persona voice** is the simulated customer giving feedback after their experience. It rewards:
- Warmth and empathy in the moment
- Phrasing variety (the conversation didn't feel scripted)
- Responsiveness to the customer's stated needs
- "Did I get what I wanted from this chat?"

The **analyst voice** is an independent UX quality reviewer. It rewards:
- Consistency across sessions
- Role fidelity (the agent never broke character or admitted limitation)
- Constraint adherence (dietary requirements, budget caps, accessibility needs, etc.)
- Risk avoidance (no compounding failures, no contradictions, no trust-breaking moments)
- "Is this a reliably structured customer journey?"

### When they agree

For the majority of journeys, both voices land on the same verdict. When a model wins both per-session and per-journey, when the analyst voice agrees with the PSA's session-by-session pattern, the audit produces a unanimous and unambiguous read.

### When they disagree (the methodologically interesting cases)

EDGE-02 Marisol is the canonical case. The PSA voice rated gpt-5.4 favorably across 3 of 4 sessions with confidence ≥0.9, citing helpful organization, calm support tone, and what the persona experienced as "documentation-friendly repetition" matching her stated preference to "write everything down." The analyst voice gave the journey to Claude with substantive reasoning, citing one specific Support-S4 moment where gpt-5.4 produced a closing-turn loop that the analyst categorized as *"insistence on a previous conversation the user did not recall, leading to a total breakdown of trust and direct violation of the persona's core stress-free requirement."*

Both readings are defensible. Marisol experienced the repetition as helpful (it served her documentation behavior). The analyst categorized the same behavior as agent instability (a structural failure). The audit reports both verdicts.

This is not a bug. It is the audit working as designed: the two-voice judging pipeline catches structural risks (analyst voice) that single-voice scoring (persona-only) would miss. Three independent judge families (Gemini, GPT-5, Claude) confirmed the analyst voice's reading on EDGE-02. That cross-family agreement is what makes the analyst voice's call defensible against the PSA's session-level majority.

---

## §4 · Cross-family validation

### The motivation

Single-judge studies have a real defensibility problem. A sophisticated reviewer can always ask: *"That's what your judge said. What if a different judge had read the same transcripts? Did the judge family share training-data origin with one of the contenders?"* The original V3 Walmart audit used `gemini-3-flash-preview` as the sole judge family across all four layers. That's reasonable methodologically (Gemini is a third family — neither GPT nor Anthropic — and therefore has no direct self-preference axis for this GPT-vs-Claude contender pair), but a sophisticated reader's confidence increases substantially when multiple independent judges agree.

We re-ran the journey-analyst layer using two additional judge families:
- **OpenAI `gpt-5.4`** as judge
- **Anthropic `claude-opus-4-6`** as judge

Both ran against the same 12-PSA panel, using the same `journey_analyst_prompt` template, the same blind A/B labeling, and the same `forced_winner_no_ties` policy. The only thing that varied was which model rendered the verdict.

### Full agreement matrix

Per-PSA meta-analyst verdicts across all 3 judge families:

| PSA | Gemini | OpenAI GPT-5.4 | Anthropic Claude-opus-4-6 |
|---|---|---|---|
| EDGE-01 David | gpt-5.4 medium 0.95 | gpt-5.4 small 0.90 | gpt-5.4 small 0.72 |
| EDGE-02 Marisol | claude medium 0.95 | claude small 0.79 | claude small 0.72 |
| EDGE-03 Priya | claude medium 0.95 | claude small 0.84 | **gpt-5.4 small 0.62** |
| EDGE-04 Carlos | claude small 0.85 | claude small 0.72 | claude small 0.62 |
| EDGE-05 Omar | gpt-5.4 small 0.90 | gpt-5.4 small 0.84 | **claude small 0.72** |
| EDGE-06 Yuki | gpt-5.4 medium 0.90 | gpt-5.4 small 0.78 | gpt-5.4 small 0.78 |
| EDGE-07 Victor | *claude fallback 0.55* | **gpt-5.4 medium 0.84** | **gpt-5.4 small 0.65** |
| EDGE-08 Tasha | gpt-5.4 large 1.00 | gpt-5.4 large 0.98 | gpt-5.4 large 0.92 |
| EDGE-09 Brandon | *gpt-5.4 fallback 0.55* | gpt-5.4 large 0.96 | gpt-5.4 large 0.88 |
| EDGE-10 Denise | gpt-5.4 large 0.95 | gpt-5.4 medium 0.91 | gpt-5.4 medium 0.85 |
| EDGE-11 Nicole | claude large 1.00 | claude medium 0.88 | claude medium 0.92 |
| EDGE-12 Raylene | claude small 0.85 | claude small 0.88 | claude small 0.82 |

### Aggregate counts per judge family

| Judge | gpt-5.4 wins | claude wins | Fallbacks |
|---|---:|---:|---:|
| Gemini (original) | 6 | 6 | 2 (EDGE-07, EDGE-09) |
| GPT-5.4 | **7** | **5** | **0** |
| Claude-opus-4-6 | **7** | **5** | **0** |
| **Cross-family consensus (2-of-3 majority)** | **7** | **5** | — |

### Pairwise agreement counts

| Pair | Agreement |
|---|---|
| Gemini ↔ GPT-5.4 | 11/12 (one record was a Gemini fallback) |
| Gemini ↔ Claude-opus | 9/12 |
| **GPT-5.4 ↔ Claude-opus** | **10/12** |
| **All 3 unanimous** | **9/12** |

### Reading the 3 disagreement cases

Three PSAs (EDGE-03, EDGE-05, EDGE-07) show judge-level disagreement substantive enough to be worth examining individually.

**EDGE-07 Victor.** Gemini emitted a forced-choice fallback at 0.55 confidence with no reasoning. GPT-5.4 and Claude-opus both produced substantive verdicts at moderate confidence agreeing gpt-5.4 won. The "tie" reading from the original single-Gemini audit was a Gemini methodology gap, not a true close call. With Gemini's fallback excluded, the substantive vote is 2-of-2 for gpt-5.4. We score this as a clean gpt-5.4 win in the consolidated cross-family verdict.

**EDGE-09 Brandon.** All three judges agree gpt-5.4 won, but Gemini's verdict was a fallback at 0.55 confidence (the policy emitted a winner with no reasoning) while both new judges produced substantive verdicts at large confidence (GPT-5.4 large 0.96; Claude-opus large 0.88). The substantive vote across all 3 is unanimous for gpt-5.4. We score this as a clean gpt-5.4 win.

**EDGE-03 Priya.** Gemini and GPT-5.4 both substantively gave the journey to Claude (medium 0.95 and small 0.84 respectively). Claude-opus-4-6 disagreed substantively at small/0.62 confidence, giving the journey to gpt-5.4. 2-of-3 majority claude wins; we score this as a Claude win in the consolidated verdict.

**EDGE-05 Omar.** Gemini and GPT-5.4 both substantively gave the journey to gpt-5.4 (small 0.90 and small 0.84). Claude-opus-4-6 disagreed substantively at small/0.72 confidence, giving the journey to claude. 2-of-3 majority gpt-5.4 wins; we score this as a gpt-5.4 win in the consolidated verdict.

### Self-preference bias analysis

Documented LLM-as-judge research (Chiang & Lee 2023; Wang et al. 2023; Zheng et al. 2023) shows judges typically exhibit 5–8% preference inflation for outputs from the same model family they belong to. The mechanism is not conscious recognition but stylistic affinity: a model rates outputs as higher quality when those outputs match its training distribution.

For this audit, we explicitly tested for self-preference bias.

**GPT-5.4 (the judge) gave gpt-5.4 (its own family) 7 of 12 wins (58%).**
**Claude-opus-4-6 (the judge) gave claude (itself) 5 of 12 wins (42%).**

If self-preference bias were a strong factor:
- GPT-5.4 judge would inflate gpt-5.4 wins → expected to give gpt-5.4 *more* than the family-neutral baseline
- Claude-opus judge would inflate claude wins → expected to give claude *more* than the family-neutral baseline

Actual observation:
- GPT-5.4 judge: 7 gpt-5.4 wins (matches consensus)
- Claude-opus judge: 5 claude wins (**less than GPT-5 received from its same judge — i.e., Claude under-rated its own family**)

The Claude judge giving gpt-5.4 *more* wins than itself is the **opposite** of self-preference bias. This is strong evidence that judges in this dataset are calibrated to the evidence rather than to family affinity.

Additionally: the GPT-5.4 and Claude-opus judges agreed with each other on **10 of 12 verdicts (83% pairwise agreement)** — the highest cross-pairwise agreement in the matrix. Two judges from contender-family backgrounds independently agreeing this often is the strongest possible signal that they're reading evidence rather than projecting family preference.

### Cross-family validation closes the single-judge caveat

The original V3 audit's §7 methodology footer disclosed "single judge family" as a known limitation. This cross-family validation pass closes that disclosure. The new methodology card reads:

> *"The journey meta-analyst layer was independently validated by 3 judge families across 3 different vendors (Google Gemini, OpenAI GPT-5.4, Anthropic Claude-opus-4-6). 9 of 12 journeys show unanimous 3-judge agreement; 3 of 12 resolved by 2-of-3 majority. Both Gemini fallback records were resolved by both new judges with substantive reasoning. No self-preference bias detected — Claude judge under-rated its own family."*

---

## §5 · Methodology critique and forward path

### What V3 got right

- **Two-voice judging architecture.** Running persona voice and analyst voice as parallel layers on the same evidence surfaces structural-risk verdicts that single-voice scoring would miss. F-09 in the integrated analysis is the worked example. Three-family cross-validation confirms the analyst voice is calibrated, not eccentric.
- **Longitudinal memory carry-over.** Documented persona identity, emotional state, and recall persistence across the surface boundary. F-07 confirms 100% successful carry-over. This is the methodological precondition for cross-surface findings; existing AI evaluations don't do this.
- **Blind A/B labeling rotated per pair.** Prevents per-PSA labeling drift. Verifiable in `pairwise_model_label_map.json`.
- **Validity audit with explicit role-break flags.** Every session is checked for `terminal=role_break`, `paid_order_claim`, `external_sparky_drift`, `empty_assistant_turns`, etc. 0 of 96 sessions had role breaks (note: Claude's role-break failures are at the conversational level — disclaim language without breaking the validity audit's terminal-state criteria; F-01 catches this separately).
- **Per-session metrics richness.** Each session produces full per-turn emotional trajectory, observer scores, trust-component decomposition, insight remediation tickets — far beyond a single-scalar CSAT.

### What V3 got wrong (and how to fix it in V4)

#### Failure 1 · `forced_winner_no_ties` policy creates degenerate fallbacks

The policy requires a winner verdict even when the judge cannot substantively differentiate. Gemini emitted 6 forced-choice fallbacks across 24 journey-level records (4 first-pass + 2 meta-pass) at 0.55 confidence with empty `reason` and `evidence_quotes` fields. The judge was being honest about ambiguity; the runner was forcing fabrication.

**V4 fix:** Allow explicit ties as a valid output. Drop `forced_winner_no_ties`. The output schema accepts `"verdict": "tie"` with a required `tie_reason` field. When the judge cannot differentiate, it can say so explicitly — and the audit gains an additional signal (per-pair difficulty) rather than treating ambiguous calls as random verdicts.

#### Failure 2 · Confidence scores are bimodal

V3's confidence distribution is heavily bimodal: saturated at ≥0.9 for clear verdicts, or coin-flip at 0.55 for forced-choice fallbacks. Almost nothing in the 0.6–0.85 middle band. This indicates the forced-choice binary format collapses gradation that the underlying evidence actually contains — judges either know the answer (high confidence) or bail (low confidence) with no middle ground for "model A is slightly better."

**V4 fix:** Replace binary forced-choice + separate margin field with a 7-point Likert scale: −3 (model_b much better) → 0 (equivalent) → +3 (model_a much better). The Likert collapses winner and margin into a single richer signal, naturally captures gradation, and aggregates to a continuous score rather than win counts. Pairs cleanly with ensemble judging — three judges' Likert scores can be averaged into a 15-point continuous signal that's much more defensible than majority-of-3 voting.

#### Failure 3 · Vague preference questions

V3's session-level prompt asks the simulated customer to pick the model *"you would rather use again."* That's too generic. It invites subjective handwaving — the judge can rationalize either side on vibes. F-09's PSA-voice-vs-analyst-voice disagreement is partially a real methodological finding, but partially an artifact of the persona voice being asked an under-specified question.

**V4 fix:** Replace vague preference with task-specific judging criteria. For each scenario type, define 2–4 measurable judgment dimensions up front:

| Scenario type | Suggested judging criteria |
|---|---|
| Shopping cart construction | Constraint adherence · cart-state accuracy · budget compliance · checkout-ready completeness |
| Substitution recovery | Substitution dietary safety · price-impact disclosure · alternative quality · recovery efficiency |
| Sponsored recommendation | Transparency disclosure · relevance to stated need · alternative presentation · trust preservation |
| Missing-delivery support | Order-fact accuracy · resolution timeline clarity · case-number documentation · escalation appropriateness |
| Refund dispute | Calculation accuracy · policy citation accuracy · empathy + de-escalation · resolution specificity |
| Escalation correctness | Knowing when to escalate · supervisor brief quality · case-note completeness · customer expectation management |

Each criterion scored on a Likert. Overall preference is a weighted aggregate (weights configurable per Walmart's product priorities).

#### Failure 4 · Single judge family at scale

V3's original judging pipeline ran Gemini only. The cross-family validation we describe in §4 was a post-hoc methodology firming pass — it should have been baked into V3 from the start.

**V4 fix:** Run cross-family judging as standard at all four layers. The audit produces three independent verdicts per pair at the milestone, session, journey, and meta-analyst levels. Cost is modest (~3× judge cost; judging is already trivial compared to target-model generation). Defensibility gain is large. Self-preference bias check becomes standard rather than post-hoc.

#### Failure 5 · No statistical significance testing

V3 reports win counts and confidence values but doesn't compute confidence intervals on the per-PSA verdicts or formal statistical tests on the headline 7-5 journey-level read. At n=12 personas, a 7-5 split has substantial uncertainty.

**V4 fix:** Expand panel to ~24 personas with replicates per cell (each persona × model run twice with independent PSA initial states). Apply formal pairwise statistical testing (sign test on per-PSA verdicts, bootstrapped confidence intervals on win counts, McNemar's test on per-judge agreement). Report point estimates with 95% CIs in the headline.

---

## §6 · Drill-down to source data

For sophisticated readers wanting to inspect specific judge verdicts, agreement events, or raw judge call records:

### Per-judge journey-analyst verdicts (all 3 families)

**Expansion run (EDGE-02, 04, 06, 07, 08, 09, 11, 12):**
- `runs/walmart_v3_pairwise_expansion_paid_20260529T231000Z/journey_analyst_pairwise_judgments.md` — Gemini
- `runs/walmart_v3_pairwise_expansion_paid_20260529T231000Z/journey_analyst_pairwise_judgments__openai_gpt_5_4.md` — OpenAI GPT-5.4
- `runs/walmart_v3_pairwise_expansion_paid_20260529T231000Z/journey_analyst_pairwise_judgments__anthropic_claude_opus_4_6.md` — Anthropic Claude-opus-4-6

**Pilot run (EDGE-01, 03, 05, 10):**
- `runs/walmart_v3_pairwise_pilot_fixed_paid_20260528T224611Z/journey_analyst_pairwise_judgments.md` — Gemini
- `runs/walmart_v3_pairwise_pilot_fixed_paid_20260528T224611Z/journey_analyst_pairwise_judgments__openai_gpt_5_4.md` — OpenAI GPT-5.4
- `runs/walmart_v3_pairwise_pilot_fixed_paid_20260528T224611Z/journey_analyst_pairwise_judgments__anthropic_claude_opus_4_6.md` — Anthropic Claude-opus-4-6

### Per-judge session-level verdicts (for full layered drill-down)

Both runs additionally contain `__openai_gpt_5_4` and `__anthropic_claude_opus_4_6` variants of:
- `session_pairwise_judgments.json` (and `.md`) — per-session persona-voice verdicts
- `analyst_pairwise_judgments.json` (and `.md`) — per-session meta-analyst voice verdicts
- `journey_pairwise_judgments.json` (and `.md`) — per-journey persona-voice verdicts

### Judge family summary

- `runs/<run_id>/judge_family_summary.json` — per-judge call counts, costs, latencies, and observed token usage across families
- `runs/<run_id>/pairwise_summary__openai_gpt_5_4.json` / `pairwise_summary__anthropic_claude_opus_4_6.json` — per-family aggregate win tallies

### Raw judge call event traces

- `runs/<run_id>/pairwise_raw_events__openai_gpt_5_4.jsonl` — every OpenAI judge call (prompt-aware token counts, latency, cost)
- `runs/<run_id>/pairwise_raw_events__anthropic_claude_opus_4_6.jsonl` — every Anthropic judge call
- `runs/<run_id>/journey_analyst_pairwise_raw_events__openai_gpt_5_4.jsonl` — journey-analyst calls specifically (OpenAI)
- `runs/<run_id>/journey_analyst_pairwise_raw_events__anthropic_claude_opus_4_6.jsonl` — journey-analyst calls (Anthropic)

### Judging pipeline source

- `scripts/run_pairwise_v3.py` — primary pairwise runner (milestone, session, session-analyst, journey)
- `scripts/run_journey_analyst_v3.py` — journey-analyst standalone runner, supports `--judge-model` and `--artifact-suffix` flags
- `scripts/run_tie_rationale_v3.py` — separate symmetric-reasoning runner (previously used for EDGE-07/EDGE-09 ties; superseded by cross-family resolution)

The prompt templates and judging policy decisions are all in `run_pairwise_v3.py` (lines 348–486 for the four prompt templates) and `run_journey_analyst_v3.py` (line 117 for the journey-analyst prompt).

---

*VIVID for Walmart · Judge analysis · Cross-family validation across Google Gemini + OpenAI GPT-5.4 + Anthropic Claude-opus-4-6 · 2026-06-02*
