# Deduplicator Test Files

Test CSVs for the lead bundle's deduplicator script (`01_deduplicator.py`).

19 files total:
- 15 use-case files (`uc01` - `uc15`) mapping 1:1 to BUSINESS.md Section 4a
- 4 edge-case files (`ec01` - `ec04`) covering Tier 1 spec items not naturally exercised by the use cases

Each file is small (5-30 rows) so expected dedup behavior can be verified by eye. Every file mixes intentional duplicates with control rows that must NOT be merged. If your dedup output collapses a control row, the matching is too aggressive.

Regenerate any file by editing `generate.py` and re-running.

---

## Use case files

### uc01 - Shopify customer list cleanup
**Tests**: per-column normalization for email and phone (Tier 1 spec items 7-8).

Expected merges:
- C001/C002/C003: Gmail dot + plus-tag + case variants of John Smith
- C005/C006/C007: same Maria Garcia, three phone-format variants
- C008/C009: same David Lee, two emails, normalized phone matches
- C010/C011: same Sarah Jones, one row has trailing whitespace in email
- C014/C015: same email + phone, name spelling typo (Catherine vs Katherine)

Must NOT merge:
- C001 (John Smith @gmail) vs C004 (Jon Smith @yahoo) - different person, different domain

### uc02 - Product catalog dedup
**Tests**: SKU normalization (whitespace, case, tab pollution) + product-name fuzzy match.

Expected merges:
- DOG-001 (4 rows): trailing space, case, tab pollution; near-identical names
- DOG-002 (2 rows): same SKU, abbreviation difference in name
- CAT-100 (3 rows): same SKU, three rendering variants of the name
- BIRD-50 (2 rows): internal whitespace difference in name

Must NOT merge:
- DOG-001 vs DOG-002 (different size variant)
- CAT-100 vs CAT-101 (different SKU, identical-looking name - control)

### uc03 - Abandoned cart cleanup
**Tests**: dedup by email across multiple cart events for re-engagement campaign hygiene.

Expected merges (don't email same person 3x):
- buyer1@example.com (3 cart events, includes case variant)
- buyer2@example.com (2 events same day)
- j.doe@gmail.com / jdoe@gmail.com (Gmail dot variant)
- eve@example.com (2 carts within 1 minute - dupe submission)

Edge case to note:
- frank@example.com appears twice 3 months apart - debatable whether to dedupe (separate engagement window). Test how your survivor-rule handles temporal distance.

### uc04 - Order export consolidation
**Tests**: same-customer-across-platforms (Shopify + Amazon + manual entry) with source preservation.

Expected merges (collapse to one customer record):
- kara.miller across shopify/amazon/manual (and a secondary email)
- tom.harris with manual-entry typo (haris vs harris)
- jenny.l across shopify and amazon with case variant

Must NOT merge:
- SHOP-2001 (Alex Park) vs SHOP-2002 (Amy Park) - different people, similar SKU prefix
- sales@bigcorp.com with different names - shared inbox, two different people

### uc05 - Subscriber list hygiene before ESP import
**Tests**: aggressive email canonicalization. Every dupe here costs $$$ on per-contact ESP pricing (Klaviyo, Mailchimp).

Expected merges:
- sub1@example.com x4 (case variants + trailing whitespace, across 4 lead sources)
- mike.smith@gmail.com x3 (Gmail dot variants)
- promos@gmail.com x2 (with and without +tag)

Must NOT merge:
- jdoe@company-a.com vs jdoe@company-b.com - different people, same name and local-part

### uc06 - Bank export reconciliation (overlapping date ranges)
**Tests**: date + amount + fuzzy-description matching. Two exports overlap on Apr 10-15.

Expected merges (5 transaction pairs in the overlap window):
- ACH DEPOSIT PAYROLL: case variation of description
- POS PURCHASE STARBUCKS #4421: punctuation/case variation
- CHECK #1042: hash mark variation
- ATM WITHDRAWAL: case variation
- ONLINE TRANSFER: case variation

Must NOT merge:
- Two coffee purchases on 2026-04-11 at $6.75 each (Starbucks vs Peet's) - same date, same amount, different vendor. This is the test for whether your matching uses description as an actual match key vs. just date+amount.

### uc07 - Vendor list consolidation
**Tests**: company-name fuzzy match (Inc/Corp/LLC/Ltd suffix variation), EIN as the strong dedup key.

Expected merges:
- Acme (3 rows): EIN matches across "Corp", "Corporation", "CORP."
- Beta Solutions (3 rows): one row has empty EIN, name fuzzy bridges the gap
- Gamma + Gamma DBA QuickPrint: same EIN, different DBA - debatable; depends on whether your survivor rule preserves DBAs as a separate field
- Delta Services (2 rows): trailing whitespace and punctuation difference

Must NOT merge:
- Smith Consulting Inc (EIN 11-1111111) vs Smith Consulting LLC (EIN 22-2222222) - very similar names but different EINs. **This is the critical control row** - if your fuzzy match merges these, you've ignored the strong key.

### uc08 - Customer master record cleanup (MERGE MODE TEST)
**Tests**: Tier 1 spec item 12 - merge mode (instead of deleting losers, fill missing fields in survivor from losers). This is the highest-value dedup feature for this use case.

Customer A (CUST-A1/A2/A3): three partial records. After merge should produce ONE record with name, email, phone, address, city, state, zip, last_purchase ALL populated.

Customer B (CUST-B1/B2): two partial records, missing fields complement each other.

Customer C (CUST-C1/C2/C3/C4): four partial records with conflicting last_purchase dates. Tests "keep most-recent" survivor rule (Tier 1 spec item 11) - should pick 2026-03-20.

Customer D (single fully-populated row): control - should pass through unchanged.

Customer E: same first/last name, different middle initial, different email + phone + address - two separate people. Must NOT merge.

Customer F: hard case - record with empty email AND empty phone. Tests whether matching falls back to name+address when primary keys are absent.

### uc09 - Expense report dedup (same receipt submitted twice)
**Tests**: date + vendor + amount triplet match with description-fuzzy as tiebreaker.

Expected merges:
- Tom R Marriott $234.56: same receipt submitted 2 weeks apart with description variation
- Anna L Uber $45.20 x3: triple submission (employee thought it failed)

Must NOT merge:
- Tom R + Sara K at Steakhouse $187.45 same date - two employees attended same dinner, both legitimately submit
- Mike P at Local Cafe same date $12.50 + $8.75 - same employee, same vendor, same date, but two different transactions (lunch + coffee)
- Tom R Marriott $234.56 (Mar 15) vs Tom R Hyatt $234.56 (Apr 1) - same amount, different vendor, different date

### uc10 - Client data dump (kitchen sink)
**Tests**: combination of every dirty-data pattern in one file.

Expected merges:
- Alice Johnson (case + trailing whitespace + phone format)
- Bob O'Brien (smart-quote vs straight-quote apostrophe in name and email-case variation)
- Carol Wu (full-width @ symbol vs ASCII @)
- Eve Martin (pure dupe)
- Henry Lo (company-name punctuation variation)
- Ivan Kim (embedded tab in name field)

Must NOT merge:
- Derek Hall (rows 7-8): same email, different phone. Real-world ambiguity - is this one person updating their phone, or shared work email?

### uc11 - Survey response dedup
**Tests**: same respondent submitting from multiple devices, time-window matching.

Expected merges:
- user1@example.com: same answers, 2 minutes apart, different IPs (phone then laptop)
- user2@example.com: 5 minutes apart, slight answer change (changed mind)
- user3@example.com: case variant of email, 2 minutes apart

Must NOT merge:
- spouse1 + spouse2 at same IP - same household but different people answering
- user4@example.com 2 weeks apart - probably legitimate re-survey

### uc12 - Lead list cleanup before client handoff
**Tests**: name variations, lead-source preservation across captures.

Expected merges:
- Patricia/Pat/Patty Anders (3 rows, captured from trade show + LinkedIn + webinar)
- Mohammed/Muhammad/Mohammad Khan (3 spelling variants, same email and phone)
- Sarah Lee with role embedded in last_name field ("Lee, VP Marketing")

Must NOT merge:
- John Adams + John Adams II at same company - father/son or sr/jr, different emails

### uc13 - Email list dedup across multiple lead sources
**Tests**: pure email-based dedup with attribution preservation. Survivor rule should keep EARLIEST capture date and (ideally) accumulate ALL source channels.

Expected merges:
- lead1@example.com x4 captures across 4 campaigns
- lead2@example.com: 3 captures with disagreeing consent values - the "no" must win
- lead.three@gmail.com x3 (Gmail dot variants)

### uc14 - Audience reconciliation across FB + Google + organic
**Tests**: cross-platform identity stitching. Email is the bridge for most rows; some rows have only one identifier.

Expected merges:
- audience1@example.com: 3 rows, one per platform, email as bridge
- audience2@example.com: 2 rows
- old.email@example.com + new.email@example.com: same FB ID across two emails - person changed primary email. **Does your dedup recognize fb_id as a strong key?**

Must NOT merge:
- Empty-email rows with each other (anonymous pixel hits, no identifier overlap)
- Single-platform identified users who don't share any cross-platform IDs

### uc15 - Suppression list combine
**Tests**: any reason to suppress is enough to suppress. Dedup must preserve the strongest (or earliest, or all) reasons.

Expected merges:
- supp1@example.com: 3 different reasons (unsubscribe + bounce + complaint)
- supp2@example.com: same reason, 3 sources
- Supp3 case variant + bounce
- bounced@gmail.com (Gmail dot variants)
- padded@example.com (trailing whitespace)

Survivor-rule design question this file forces: when multiple reasons exist for the same email, what does the cleaned record show? Severity ranking (complaint > bounce > unsubscribe), or concatenated list? Both are defensible. Test reveals which your script does.

---

## Edge-case files

### ec01 - Windows-1252 (Latin-1) encoding
**Tests**: Tier 1 spec item 1 (auto-detect file encoding).

Same data with and without diacritics (Björn/Bjorn, François/Francois, José/Jose, etc.). If your script reads this file as UTF-8 it will either crash or read garbage like "Bj�rn". If it correctly detects cp1252 it will load the names properly, and per-column normalization can decide whether to fold diacritics for matching.

Real-world relevance: Windows accounting tools and older Excel exports default to this encoding.

### ec02 - Semicolon delimiter
**Tests**: Tier 1 spec item 2 (auto-detect delimiter).

Standard CSV with `;` delimiter (European convention because comma is the decimal separator there - note the amounts use `1.234,56` format). If your script hardcodes `,` as delimiter, it will read each row as a single column and dedup will fail silently.

### ec03 - Tab-separated (TSV)
**Tests**: Tier 1 spec item 2, plus item 3 (read TSV format).

Same dedup challenge as a small product catalog. Tab is also a common export format from databases and command-line tools.

### ec04 - UTF-8 with BOM
**Tests**: Tier 1 spec item 1 - BOM detection.

Excel-on-Windows "Save as CSV UTF-8" produces files with a UTF-8 BOM (bytes EF BB BF) at the start. If your CSV reader doesn't strip the BOM, the first column header becomes `\ufeffcustomer_id` and any code referencing the column by name fails. Common silent-failure source.

---

## Suggested test workflow

1. **Smoke test**: run dedup on `uc01` and `uc09` with default settings. If these don't produce sane output the script isn't ready for any other file.
2. **Tier 1 fuzzy matching**: `uc02`, `uc05`, `uc06`, `uc07`, `uc12`. These exercise different fuzzy strategies (SKU, email canonicalization, transaction matching, company suffixes, name variants).
3. **Merge mode**: `uc08`. Specifically tests the spec item 12 capability that is the strongest argument for paying $49-$79 over free Excel.
4. **Multi-strategy OR matching** (spec item 10): `uc01`, `uc04`, `uc14`. These need "match if email exact OR (name fuzzy AND phone exact)" style logic.
5. **Edge-case file handling** (spec items 1-3): all four `ec*` files. These should work without the user touching encoding or delimiter settings.
6. **Survivor rules** (spec item 11): `uc08` (most-recent), `uc13` (earliest), `uc15` (severity-ranked).
7. **Interactive review** (spec item 14): `uc09` (Anna L Uber x3), `uc10` (Derek Hall ambiguous), `uc14` (cross-platform stitching). These have gray-zone matches that should land in the review queue.

The "Must NOT merge" controls in every file are how you measure false-positive rate. A dedup tool that catches everything by collapsing too aggressively is worse than Excel.