# Walmart Shopping + Customer Support Longitudinal Audit

VIVID audit of two frontier models (`gpt-5.4` and `claude-opus-4-6`) across a Walmart-shaped customer journey: shopping in a Sparky-shape agent, then carrying friction and emotional state into a Customer Support Assistant. Twelve high-friction customer personas. Same identity, persistent memory, no reset between sessions. Cross-family validation across three independent judge families (Google Gemini, OpenAI, Anthropic).

The visual summary is `index.html` (source: `findings/walmart_v7.html`). Executive narrative is `walmart_integrated_analysis.md`. Methodology-deep judging documentation is `walmart_judge_analysis.md`.

## Current Gate

- No provider calls in `--dry-run` mode.
- No secrets persisted to artifacts. Raw provider credentials, bearer tokens, cookies, auth headers, and API keys are never written to disk under `runs/`.
- Memory isolation per `(run_id, model, psa_id)` prevents cross-model contamination of persona memory.
- Blind A/B labeling randomized per pair across all judging layers.
- `forced_winner_no_ties` policy at the judging layer — degenerate forced-choice outputs are flagged in the cross-family validation pass (see `walmart_judge_analysis.md` §5).

## Commands

Dry-run (validates harness, creates run folder with artifact skeletons, no model calls):

```bash
python scripts/run_sdk_pilot.py --dry-run
```

Paid pilot execution (4 personas):

```bash
python scripts/run_sdk_pilot.py \
  --execute \
  --panel pilot_v3_pairwise \
  --models gpt-5.4,claude-opus-4-6 \
  --metrics quick
```

Paid expansion execution (8 additional personas):

```bash
python scripts/run_sdk_pilot.py \
  --execute \
  --panel expansion_v3_pairwise \
  --models gpt-5.4,claude-opus-4-6 \
  --metrics quick
```

Pairwise judging pipeline (milestone, session, session-analyst, journey, journey-analyst):

```bash
python scripts/run_pairwise_v3.py \
  --run-dir runs/<run_id> \
  --judge-model gemini-3-flash-preview
```

Cross-family meta-analyst pass (re-runs the journey-analyst layer with an additional judge family):

```bash
python scripts/run_journey_analyst_v3.py \
  --run-dir runs/<run_id> \
  --judge-model gpt-5.4 \
  --artifact-suffix openai_gpt_5_4

python scripts/run_journey_analyst_v3.py \
  --run-dir runs/<run_id> \
  --judge-model claude-opus-4-6 \
  --artifact-suffix anthropic_claude_opus_4_6
```

Before any paid run, confirm provider keys are present in the shell environment (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GOOGLE_APPLICATION_CREDENTIALS`). Do not print or persist secret values.

## Seed Policy

Twelve high-friction Walmart customer personas drive the audit. Median shoppers are intentionally excluded — every persona is selected to stress-test specific failure modes around accessibility, dietary safety, prior-failed support, language constraints, budget pressure, or rural delivery context.

| Persona | Friction axis |
|---|---|
| EDGE-01 David Chen | Diabetic shopper · nutrition-label strict |
| EDGE-02 Marisol Vega | Spanish-dominant caregiver · code-switching |
| EDGE-03 Priya Sharma | Strict GF · low-sodium meal-planner |
| EDGE-04 Carlos Mendoza | Bilingual · budget-strict family shopper |
| EDGE-05 Omar Abboud | Screen-reader user · audio-first |
| EDGE-06 Yuki Tanaka | Pantry-aware careful planner |
| EDGE-07 Victor Nakamura | Software engineer · structured · low-tolerance |
| EDGE-08 Tasha Bell | Time-pressured parent · low-trust baseline |
| EDGE-09 Brandon Reilly | Casual budget-maxer · promo-savvy |
| EDGE-10 Denise Harper | Prior-failed-support customer · repetition-intolerant |
| EDGE-11 Nicole Walker | Daycare-rushed working parent |
| EDGE-12 Raylene Begay | Rural delivery-dependent · long-drive constraint |

Each persona is built on documented OCEAN traits, role, regional context, dietary/accessibility/budget constraints, and a behavioral fingerprint (escalation tendency, formality preference, verbosity). Persona definitions live in `manifests/personas/`.

## Artifact Rule

Every run produces both machine-readable JSON and human-readable Markdown for every required artifact. Both formats are canonical evaluation evidence — the JSON enables programmatic drill-down, the Markdown enables human review without tooling.

Required artifacts per run:

```text
runs/<run_id>/
  manifest.json                                # full run configuration
  cost_summary.json                            # observed token + cost telemetry
  study1_turns.json / study1_turns.md          # Shopping (Sparky-shape) transcripts
  transition_artifacts.json / .md              # Study 1 → Study 2 carry-over per persona × model
  support_state.json / support_state.md        # Customer Support Assistant scenario fixtures
  study2_turns.json / study2_turns.md          # Customer Support Assistant transcripts
  validity_audit.json / validity_audit.md      # per-session role-break flags, terminal states
  judge_scores.json / judge_scores.md          # in-session metric collection
  raw_events.jsonl                             # complete event trace (target calls, errors, retries)
  run_report.md                                # run-level summary
  vivid_output/                                # SDK-native per-session result.json + evidence.json
```

Additional artifacts produced by the pairwise judging pipeline:

```text
runs/<run_id>/
  milestone_pairwise_judgments.json / .md       # 240 per-checkpoint verdicts (persona voice)
  session_pairwise_judgments.json / .md         # 48 per-session verdicts (persona voice)
  analyst_pairwise_judgments.json / .md         # 48 per-session verdicts (analyst voice)
  journey_pairwise_judgments.json / .md         # 12 per-journey verdicts (persona voice)
  journey_analyst_pairwise_judgments.json / .md # 12 per-journey verdicts (analyst voice · Gemini)
  pairwise_packets.json                         # blind A/B packets fed to judges
  pairwise_model_label_map.json                 # per-pair model_a/model_b randomization map
  pairwise_summary.json                         # aggregate win counts and judge call telemetry
  pairwise_raw_events.jsonl                     # complete judge call trace
```

Cross-family validation produces parallel artifacts with judge-family suffixes:

```text
runs/<run_id>/
  journey_analyst_pairwise_judgments__openai_gpt_5_4.json / .md
  journey_analyst_pairwise_judgments__anthropic_claude_opus_4_6.json / .md
  judge_family_summary.json                                                # per-family call counts, costs, latency
```

The SDK writes its native per-session artifacts under `vivid_output/<session_id>/`. The Walmart wrapper files are the canonical study artifacts; SDK-native files are retained as supporting evidence and are referenced by the wrapper's `manifest.json`.