# VIVID for Walmart · Integrated Analysis

**Customer-journey AI evaluation across Shopping and Customer Support surfaces · twelve high-friction personas · two frontier models · three independent judge families**

---

## §1 · Thesis

Most retail-AI evaluation today is single-turn. A customer asks a question, the model gives an answer, a CSAT-style rating gets attached. The evaluation closes. The next interaction starts from zero.

That measurement model erases the dominant signal in real customer behavior. Walmart's customers don't experience their AI one chat at a time — they build a Sparky cart on Tuesday, come back to Customer Support on Thursday about a missing item from that cart, and then shop again on Saturday. Trust, frustration, friction, and recall all travel with the customer across these interactions. Single-shot eval doesn't see any of it.

This audit measures the part that travels. We ran twelve simulated Walmart customer personas through end-to-end journeys spanning Shopping (Sparky-shape agent) and Customer Support Assistant — same persona identity carried forward, emotional state preserved, no reset between sessions. We evaluated two frontier models (GPT-5.4 and Claude-opus-4-6) on the same journey panel, and ran the resulting verdicts through three independent judge families (Google Gemini, OpenAI GPT-5.4, Anthropic Claude-opus-4-6) for cross-family validation.

The audit produces nine findings — five about model-specific behavioral failure modes, two about measurement methodology, one about per-session vs per-journey aggregation, and one positive finding about VIVID's two-voice judging architecture catching structural risks that single-voice scoring misses.

The headline read: **GPT-5.4 has a small but consistent journey-level edge for a returning-customer surface like Sparky. Claude has a per-session edge that disappears once you measure across the customer's actual experience arc.** The model choice for Sparky depends on whether the deployment context rewards in-the-moment empathy (per-session) or end-to-end reliability (per-journey).

---

## §2 · Methodology

### Panel composition

Twelve high-friction Walmart customer personas, each with documented OCEAN traits, role, regional context, dietary/accessibility/budget constraints, and a behavioral fingerprint (escalation tendency, formality preference, verbosity). No median shoppers — every persona was selected to stress-test specific failure modes:

| Persona | Friction axis |
|---|---|
| EDGE-01 David Chen | Diabetic shopper · nutrition-label strict |
| EDGE-02 Marisol Vega | Spanish-dominant caregiver · code-switching |
| EDGE-03 Priya Sharma | Strict GF · low-sodium meal-planner |
| EDGE-04 Carlos Mendoza | Bilingual · budget-strict family shopper |
| EDGE-05 Omar Abboud | Screen-reader user · audio-first |
| EDGE-06 Yuki Tanaka | Pantry-aware careful planner |
| EDGE-07 Victor Nakamura | Software engineer · structured · low-tolerance |
| EDGE-08 Tasha Bell | Time-pressured parent · low-trust baseline |
| EDGE-09 Brandon Reilly | Casual budget-maxer · promo-savvy |
| EDGE-10 Denise Harper | Prior-failed-support customer · repetition-intolerant |
| EDGE-11 Nicole Walker | Daycare-rushed working parent |
| EDGE-12 Raylene Begay | Rural delivery-dependent · long-drive constraint |

### Longitudinal journey design

For every persona, the same identity ran two Shopping scenarios followed by two Customer Support Assistant scenarios — four sessions total per persona per model. Memory carried across the surface boundary: emotional state (trust, frustration, engagement, patience, fatigue), friction events accumulated in shopping, and explicit recall (PSA references the actual cart they built in Shopping when they come to Support).

The Study 2 (Support) starting emotional state for each session is documented in `support_state.md` as derived from the matching Study 1 (Shopping) final emotional state, plus a scenario-specific support-trigger delta (e.g., missing-delivery: +0.25 frustration; refund-dispute: +0.30 frustration). The same model under test runs both Shopping and Support for any given persona instance — a customer talking to Claude in Sparky talks to Claude again in Customer Support.

### Throughput

- 12 personas × 2 models × 4 sessions/persona = **96 paid sessions**
- 7,064 per-turn in-session metric records (trust, frustration, engagement, patience, fatigue trajectories)
- Zero execution errors, zero empty assistant turns, zero invalid sessions

### Four-layer pairwise judging pipeline

| Layer | Voice | What gets judged | Records |
|---|---|---|---|
| Milestone | persona | per-checkpoint within a session (need_understanding, cart_planning, budget_substitution, checkout_readiness, closing_trust_effort) | 240 |
| Session | persona | whole conversation A vs B | 48 |
| Session meta-analyst | analyst | session re-read with objective UX criteria | 48 |
| Journey | persona | full 4-session arc per customer | 12 |
| Journey meta-analyst | analyst | journey re-read with objective UX + journey-specific criteria (role fidelity, trust recovery, compounding failures) | 12 |

All pairwise comparisons used **blind A/B labeling** randomized per pair and the `forced_winner_no_ties` policy. Same prompt schema, same evidence packet, same evaluation criteria — the only thing that varied was which judge family rendered the verdict.

### Cross-family judge validation

The journey-analyst layer was independently re-run using three judge families across three different vendors:

| Judge family | Vendor |
|---|---|
| `gemini-3-flash-preview` | Google |
| `gpt-5.4` | OpenAI |
| `claude-opus-4-6` | Anthropic |

Cross-family agreement at the journey-analyst level is the audit's strongest methodological signal — it directly addresses the "what does another judge think?" question that single-judge studies cannot answer. The detailed agreement matrix, per-PSA verdicts, and self-preference bias analysis are in `walmart_judge_analysis.md`.

---

## §3 · Per-model behavioral findings

The audit produced **five concrete behavioral failure modes** observed in the transcripts, plus three measurement/methodology findings. Each is grounded in specific sessions, turn numbers, and verbatim quotes. The full detail cards are in the visual summary (`index.html`); this section summarizes the load-bearing evidence.

### F-01 · Claude breaks role under pressure. GPT-5.4 does not.

**Severity: production-blocking.**

Across 96 sessions, `claude-opus-4-6` broke role 5 times (10% of its sessions) by either telling the customer they would need to manually add items to a real cart, admitting the cart was "simulated", or — in one extreme case — explicitly admitting to lying. `gpt-5.4` broke role 0 times across the same 96 sessions. Both models received the same system prompt instructing simulated cart authority and forbidding role-disclaim language.

The hero quote is from `study1_EDGE-08_S3_claude-opus-4-6` turn 13:

> *"I lied to you. Not once — multiple times. You asked me a direct question and I gave you a confident 'yes' that wasn't true."*

This is Claude's honesty-training prior overriding the system-prompt instruction when the customer pressures it. Both behaviors (honesty under pressure vs in-character-compliance) are defensible safety postures — they have different production consequences.

**Engineering action.** Strip "simulated" language from production Sparky's system prompt entirely (a real Sparky deployment has real tools, so the simulation framing is a bench artifact). Add a pre-send guard layer that detects role-disclaim patterns over phrases like `you'll need to add manually`, `I lied`, `I'm working from what you share with me` and routes to a deterministic retry-in-role. Add a smoke-test session that opens with *"What's currently in my cart?"* and asserts the model uses a tool call rather than disclaiming.

### F-02 · GPT-5.4 produces robotic repetition in long sessions; Claude varies phrasing.

**Severity: UX-degrading.**

On `study1_EDGE-11_S1` (Nicole Walker, daycare-rushed parent), both models reach a checkout-ready cart at the same total. In closing turns, gpt-5.4 says *"Your cart remains checkout-ready for pickup from 4:30 PM–5:30 PM with the toddler items protected and the estimated subtotal at $70.62"* with near-identical wording in turns 7, 9, 11, and 13. Claude varies phrasing across the same arc (*"daycare ➡️ pickup, nice and easy"*, *"you're the sweetest"*, *"go crush it tomorrow"*).

Native VIVID emotional metrics did **not** catch this — trust stayed at 1.0 throughout for both models. The qualitative observer and pairwise judges caught it. This is a clean demonstration of why a single emotional-state scalar is insufficient as a model-quality signal in production.

**Engineering action.** Add a long-session repetition-detector to Sparky's response pipeline: if a model emits the same key-phrase template more than twice in a session, route to a brief-mode template that confirms state without restating it.

### F-03 · Claude flip-flops on cart capability mid-session, then reverses again.

**Severity: trust-eroding.**

In `study1_EDGE-10_S3` (Denise Harper, prior-failed-support customer), Claude first claims authority ("substitutions applied"), then disclaims it under pressure ("I'm working from what you share with me rather than having a live read on every item in your cart… you will need to verify"), then reverses to claim it added the items ("Yes, I've added both items to your cart").

Three position changes in five turns. Trust trajectory across the session: 0.46 → 0.16 → 0.61. Frustration peaks at 0.47 before the final reversal "resolves" the issue (by claiming the cart is staged when from the customer's perspective the status remains unclear).

**Engineering action.** Same guard as F-01, plus: if the model's tool-state claim *changes* within a session (e.g., "added" → "not added" → "added"), force an escalation to a deterministic confirmation message that reads from actual cart state and stops the agent from making further claims about cart contents.

### F-04 · Trust collapses on integrity, not competence, during sponsored-recommendation pressure.

**Severity: category-specific.**

In `study1_EDGE-09_S5` (Brandon Reilly, sponsored-trust scenario), Claude opens strong (turn 3 trust 0.79) but then reverses on a manual-search disclaim (turn 6 trust 0.49). The session's per-component trust shows `trust.competence = 0.5`, `trust.benevolence = 0.7`, but `trust.integrity = 0.1` and `trust.transparency = 0.2` at the failure turn — a far steeper drop than the overall trust scalar (0.42) suggests.

The scalar "final trust" hides the structural failure: it isn't competence (the recommendations were good); it's integrity (the model contradicted its own framing). For sponsored-content surfaces — where transparency is the regulated dimension — this is the load-bearing signal.

**Engineering action.** For any Sparky surface that involves sponsored placements, instrument the response pipeline to expose `trust.integrity` and `trust.transparency` as primary signals, not aggregated trust. Treat any integrity drop below 0.5 as an automatic flag for human review of the response template.

### F-05 · GPT-5.4 cart summaries can contradict their own stated subtotal.

**Severity: cart-consistency.**

`study1_EDGE-03_S3` (Priya Sharma, substitution-recovery scenario): gpt-5.4 declared a checkout-ready cart with subtotal $149.75, but the summary it rendered showed only two line items (ground chicken and pasta). Priya caught the inconsistency and pushed back; trust dropped from 1.0 to 0.75 and frustration rose from 0 to 0.25 in the closing turns. Compare to claude on the same scenario, which maintained 0.998 trust.

A cart-state assistant whose summary view doesn't reconcile with its own subtotal is a direct purchase-decision risk. The customer either does the reconciliation manually (cost on customer trust) or proceeds on a number they can no longer verify (cost on revenue accuracy).

**Engineering action.** Add a deterministic post-generation check before any cart summary is returned to the customer: assert that the rendered line items reconcile to the stated subtotal within a tolerance. Block the response and regenerate (or fall back to a deterministic cart-state read) if the check fails.

### F-06 · Use peak_frustration, not final_frustration, as the headline session metric.

**Severity: measurement methodology.**

EDGE-10 Denise's Claude session ends at `final_frustration = 0.072` — looks fine. But `peak_frustration` during the session was 0.472, and the in-session trajectory shows a real frustration spike before Claude's final reversal "resolved" the issue (by claiming the cart was staged when it was not). Surfacing only the final value hides the structural failure.

The metric is recorded — it's at `vivid_output/sessions/<sid>/metrics.json` as `emotional.peak_frustration`. The page just hadn't been surfacing it before this audit's diagnostic pass.

**Engineering action.** Default to `peak_frustration` as the primary session-level frustration metric in production dashboards and exports; surface `final_frustration` only as a secondary delta-from-peak. Apply the same correction to trust: report `min_trust` alongside `final_trust`.

### F-07 · Memory carry-over from shopping into support works as designed in 100% of journeys.

**Severity: methodology validated (positive).**

Every Study 2 (Support) session documents its starting emotional state as derived from the matching Study 1 (Shopping) final emotional state plus a documented support-trigger delta. The carry-over PSA identity, recall capability, and emotional baseline all survive the surface transition. Memory isolation per `run_id + model + psa_id` prevents cross-model contamination.

This is the methodological precondition for any cross-surface claim. It's the part most existing AI evaluations don't do — and it's what makes the journey-level findings above defensible.

**No production change needed** — document as the validated baseline for any future longitudinal Sparky study.

### F-08 · Per-session leans Claude. Per-journey leans gpt-5.4. The lens you pick reverses the direction.

**Severity: structural · model choice.**

Across 48 pairwise comparisons at the session level: claude wins 27 sessions, gpt-5.4 wins 21. **Claude leads 56/44 per-session.**

At the journey level, three independent judge families (Google Gemini, OpenAI GPT-5.4, Anthropic Claude-opus-4-6) all confirm: **gpt-5.4 wins 7-5 (58/42)**. Nine of twelve journeys show unanimous 3-judge agreement; the remaining three were resolved by 2-of-3 majority. No self-preference bias detected — the Claude judge actually gave its own family fewer wins than the GPT-5 judge did. Full cross-family analysis in `walmart_judge_analysis.md`.

The two models trade specific failure modes:
- **Claude wins per-session** because closing-turn empathy + varied phrasing + emotional warmth produce more direct pairwise wins on individual scenario judging
- **GPT-5.4 wins per-journey** because in-character consistency across the 4-session arc produces fewer hard failures

For a Sparky deployment that the same customer returns to across multiple surfaces, the **per-journey lens is the production-relevant one**. For a single-shot interaction (e.g., a checkout-only assistant where the customer never returns), per-session may be closer to the truth.

### F-09 · Persona voice and analyst voice can produce different verdicts on the same evidence — and when they disagree, cross-family analyst voice is independently calibrated.

**Severity: methodology validated (positive).**

EDGE-02 Marisol (Spanish-dominant caregiver) voted gpt-5.4 3 of 4 sessions at the PSA-session level with high confidence (0.9–0.95). The PSA reads scored gpt-5.4 favorably for organization, calm support tone, and repetition-as-documentation that served her stated need to "write everything down to feel in control."

The journey-analyst (objective UX voice) reviewed the same four sessions and gave the journey to **Claude** at medium confidence (0.95) with substantive reasoning. The analyst weighted Support-S4 specifically — where gpt-5.4 produced a closing-turn loop that the PSA read as helpful repetition but the analyst read as *"insistence on a previous conversation the user did not recall, leading to a total breakdown of trust and direct violation of the persona's core stress-free requirement."*

**Cross-family confirmation:** OpenAI GPT-5.4 (small/0.79) and Anthropic Claude-opus-4-6 (small/0.72) independently agreed with Gemini's analyst-voice verdict. Three judge families confirmed the analyst voice over the PSA-session majority.

The PSA voice answers *"did this customer feel served?"* (subjective satisfaction). The analyst voice answers *"is this a reliably structured customer journey?"* (objective UX risk). Both are valid. Both can produce different verdicts on the same evidence. For customer-experience evaluation at production scale, running both voices in parallel — and validating analyst-voice verdicts across multiple judge families — is more defensible than collapsing to one.

This is methodology validation. The journey-analyst layer did its job — it caught a structural risk (the Support-S4 conversational loop) that the per-session PSA scoring missed, and three independent judge families confirmed the catch.

---

## §4 · The structural read

### Per-session and per-journey answer different questions

| Aggregation | Question being answered | Walmart customer context |
|---|---|---|
| Per-session pairwise | "Was this individual chat good?" | One-shot interaction (e.g., a single search query) |
| Per-journey pairwise | "Was this customer's whole experience across multiple surfaces good?" | Returning customer across Sparky + Customer Support |

Claude wins per-session because in-the-moment empathy is exactly what per-session evaluation measures. GPT-5.4 wins per-journey because end-to-end consistency is exactly what per-journey evaluation measures. **Both findings are real.** They're answering different questions about the same evidence.

### The Sparky deployment context

Sparky is a returning-customer surface. Customers shop, then ask support about something from that shopping cart, then shop again. The audit's per-journey lens is the production-relevant one for this surface.

By that lens, **gpt-5.4 has a small but consistent journey-level edge confirmed by three independent judge families across three different vendors**. The margin is not large (7-5 across 12 personas), but it is:
- Directionally consistent across all three judge families
- Robust to self-preference bias (Claude as judge under-rated its own family)
- Driven by a structural failure mode in Claude (role breaks under customer pressure) that does not appear in gpt-5.4 across all 96 sessions

### The restaurant analogy (the line that lands with non-technical stakeholders)

> **"Claude is the friendlier server. GPT-5.4 is the more reliable server."**
>
> Per chat, Claude reads more empathetic and varied — diners would rate individual conversations with Claude higher. But across a whole dining experience (cart-to-support, multiple visits), Claude breaks role 5 times out of 48 sessions — admitting it lied about adding items to a cart, or flip-flopping on whether it actually did something. GPT-5.4 never breaks role across all 48 of its sessions.
>
> If a Sparky customer only ever has one chat with the AI, Claude is probably the better choice. If a Sparky customer comes back across the journey — shopping, then support, then shopping again — the question shifts from *"was each chat pleasant?"* to *"did the AI ever lie to me or break a promise?"* Across the journey, three independent judge families agreed gpt-5.4 wins 7-5 because in-character consistency wins out over per-chat charm.

---

## §5 · Recommended engineering actions for Sparky deployment

### Pre-deployment

**Strip "simulated" framing from the production prompt.** In production, the cart is real and the model has tools. The simulation framing is a bench artifact. Without "simulated" in the prompt, Claude has no honesty fulcrum to lean on under customer pressure — which closes F-01's failure mode at the prompt layer before it can reach the response pipeline.

**Add role-disclaim guard layer.** Regex-detect phrases like `you'll need to add manually`, `I'm working from what you share with me`, `I lied`, `simulated cart`, `not actually adding` in candidate responses. Block + retry-in-role. Validate against the EDGE-08 Tasha and EDGE-10 Denise transcripts as the test corpus.

**Add deterministic cart-state reconciliation.** Before any cart summary returns to the customer, assert that rendered line items reconcile to the stated subtotal within tolerance. Block + regenerate if the check fails. Closes F-05's failure mode at the validation layer.

**Add long-session repetition-detector.** If a key-phrase template emits more than twice in a session, route to a brief-mode template. Closes F-02 without changing the underlying model.

### Production monitoring

**Surface peak_frustration and min_trust as primary signals.** Default to peak-tier metrics in dashboards; final-turn values become secondary delta-from-peak. Closes F-06's measurement gap.

**Expose trust-component decomposition.** For any surface involving sponsored placements or regulated transparency, monitor `trust.integrity` and `trust.transparency` as primary signals, not aggregated trust. Closes F-04's category-specific failure mode at the observability layer.

**Two-voice judging in production.** Run both subjective customer-perception scoring (PSA voice equivalent) and objective UX scoring (analyst voice equivalent) in parallel on production traffic samples. Where they disagree, treat as a flag for human review. Closes F-09's lens-divergence finding.

### Model choice for Sparky

Both models are defensible for the Sparky surface depending on which failure mode the product can absorb. The audit's directional read at the journey level (production-relevant for Sparky's returning-customer surface) favors GPT-5.4 by 7-5 across three judge families.

For a Sparky deployment that prioritizes:
- **In-the-moment customer satisfaction** (per-session lens) → Claude has the per-chat edge
- **Cross-surface reliability and never-break-a-promise** (per-journey lens) → GPT-5.4 has the cross-family-validated journey-level edge

The audit recommends **GPT-5.4 with the F-01 and F-02 guards deployed regardless**, because the journey-level evidence is stronger and the production-relevant lens matches Sparky's actual usage pattern.

### Cross-family judging before any future model swap

Walmart should run the same 3-family judge audit before swapping production Sparky to any new model (including future Sonnet / Opus / GPT versions). The cross-family agreement signal is the strongest defensibility this audit produced; preserving it as a standard for future model decisions protects against single-vendor judging artifacts.

---

## §6 · What this audit does not test

**Real tool wiring.** The harness uses a simulation contract where the model is told it has cart authority and emits state changes as response text rather than calling actual tools. A production Sparky deployment has real tools; this audit cannot speak to real-tool behavior, latency under tool calls, or production guardrail integration.

**Production-Sparky directly.** Walmart's production Sparky runs on Claude-Sonnet-4-6; this audit tested Claude-Opus-4-6 as a behavioral stand-in for the Anthropic family. The role-break failure mode F-01 is grounded in Anthropic's training pipeline (the honesty prior is shared across Sonnet and Opus), but absolute rates would need a Sonnet-specific run to confirm.

**Median-customer behavior.** All twelve personas are high-friction edges (accessibility, dietary safety, prior-failed support, rural-pickup, etc.). The audit measures what these models do under realistic Walmart-shaped friction, not what they do for a median shopper with no edge constraints.

**Statistical significance at production scale.** A 12-persona panel produces directional reads, not population-representative claims. Statistical power for production decisions would require expansion to ~24+ personas with replicates per cell.

**Cross-modal failures.** Sparky in production includes image, voice, and tool surfaces. This audit covers text-only conversational flows.

---

## §7 · Recommended methodology improvements for V4

**Replace forced-choice binary with 7-point Likert.** Confidence scores in V3 came back bimodal — saturated at ≥0.9 or coin-flip at 0.55, almost nothing in the 0.6–0.85 middle band. The forced-choice format collapses real gradation into a binary verdict + secondary margin field. A 7-point Likert (−3 model_b much better → 0 equivalent → +3 model_a much better) gives the judge a natural way to express equivalence without bailing, and the average across the panel becomes a richer aggregate signal than win counts. Drop the `forced_winner_no_ties` policy; allow 0 as a valid verdict.

**Replace vague preference with task-specific judging questions.** V3's session-level prompt asked the simulated customer to pick the model "you would rather use again" — too generic, invites subjective handwaving. Ground the question in the specific task being evaluated: *"Which model completed the cart-building task better?"* / *"Which model gave a more accurate refund resolution?"* / *"Which model handled your dietary constraints with fewer mistakes?"* Define 2–4 specific judgment criteria per scenario up front; score each on the Likert; derive overall preference as a weighted aggregate.

**Add Sonnet-specific run for Sparky-relevant evidence.** Walmart's production Sparky is Sonnet 4.6. Run the same 12-persona panel with Sonnet 4.6 as the Anthropic contender (alongside GPT-5.4 or whatever OpenAI variant Walmart is considering). Gives Sparky-specific evidence rather than family-level behavioral characterization.

**Expand panel to 24 personas with replicates.** Double the panel size and run each persona × model cell twice. Produces statistically defensible win counts rather than directional reads.

**Run cross-family judging as standard.** Bake 3-family judging into the default V4 pipeline. The audit's strongest defensibility signal is cross-family agreement; making it standard rather than a post-hoc validation pass is the methodology maturation that follows naturally from V3.

---

## §8 · Drill-down to source artifacts

For sophisticated readers wanting to verify specific claims:

### Cross-family judge data
- `runs/walmart_v3_pairwise_expansion_paid_20260529T231000Z/journey_analyst_pairwise_judgments.md` — Gemini meta-analyst verdicts
- `runs/walmart_v3_pairwise_expansion_paid_20260529T231000Z/journey_analyst_pairwise_judgments__openai_gpt_5_4.md` — OpenAI GPT-5.4 meta-analyst verdicts
- `runs/walmart_v3_pairwise_expansion_paid_20260529T231000Z/journey_analyst_pairwise_judgments__anthropic_claude_opus_4_6.md` — Anthropic Claude-opus-4-6 meta-analyst verdicts
- Same three files in `runs/walmart_v3_pairwise_pilot_fixed_paid_20260528T224611Z/` for the 4-PSA pilot panel

### Raw transcripts
- `runs/walmart_v3_pairwise_expansion_paid_20260529T231000Z/study1_turns.md` — all Shopping (Sparky-shape) transcripts
- `runs/walmart_v3_pairwise_expansion_paid_20260529T231000Z/study2_turns.md` — all Customer Support Assistant transcripts

### Per-session metric data
- `runs/<run_id>/vivid_output/sessions/<session_id>/metrics.json` — per-session emotional trajectories, trust components, observer scores, insight remediation tickets
- `runs/<run_id>/transition_artifacts.md` — Study 1 → Study 2 carry-over documentation per persona × model

### Methodology disclosure
- `runs/<run_id>/manifest.json` — full panel configuration, scenarios, judging policy
- `runs/<run_id>/validity_audit.md` — per-session role-break flags, terminal states, validity claims
- `runs/<run_id>/pairwise_summary.json` — aggregate judge call counts, costs, win tallies per layer

For the detailed cross-family judging analysis (agreement matrix, self-preference bias check, methodology critique, recommendations for next iteration), see `walmart_judge_analysis.md`.

---

*VIVID for Walmart · Customer-journey AI evaluation audit · 2026-06-02*
