# TEST-CASES.md - `02_text_cleaner.py` Test Corpus

**Version**: 1.0
**Last updated**: April 29, 2026
**Companion to**: TECHNICAL.md Section 9 (script boundaries) and the per-script functional spec template introduced in TECHNICAL.md Section 10.1.

## Purpose of this document

Defines the complete set of behaviors `02_text_cleaner.py` is expected to exhibit, with one test fixture per behavior. Used as:

1. The build target when porting the (currently skeleton) script to working state.
2. The pytest input set once the script ships.
3. The acceptance criteria for the GUI port (every fixture must produce its expected output through both CLI and Streamlit GUI).

Each test case has an input file in `test_data/` and (where exact-diff comparison applies) an expected-output file in `expected/`.

---

## 1. Scope boundary (what 02 owns vs what it doesn't)

This is the load-bearing decision. Every contested case routes back to it.

**02 owns: character-level hygiene only.**

- Whitespace normalization (outer trim + internal collapse for text columns).
- Unicode normalization (NFC by default, NFKC opt-in).
- Smart-punctuation ASCII-fication (curly quotes, em/en dash, ellipsis, primes).
- Invisible / zero-width character stripping.
- Control character stripping (with explicit allowlist for tab/newline inside quoted cells).
- BOM detection on input, never written on output.
- Line-ending normalization at the file level AND inside multi-line cells.
- Optional case operations (per-column, opt-in only).

**02 does NOT own:**

| Concern | Owned by |
|---|---|
| Detecting and replacing nulls / sentinel codes | `04_missing_value_handler` |
| Reformatting dates, currencies, phones, names, addresses | `03_format_standardizer` |
| Outlier detection or domain-rule violations | `06_outlier_detector` |
| Renaming or reordering columns | `05_column_mapper_enforcer` |
| Deduplication (even though dedup normalizes internally) | `01_deduplicator` |
| File encoding detection on read | The shared I/O layer in `src/core/io.py` |

**Invariant 02 must preserve:** after running 02, the schema (column count, column order, row count) is unchanged. 02 changes cell *content*, never *structure*. The one nuance: a cell containing only whitespace becomes an empty string, but the cell still exists and the row is not dropped.

---

## 2. Default configuration assumed by these tests

Tests assume the default config below. Any test that exercises a non-default flag explicitly says so in its description.

| Setting | Default | Notes |
|---|---|---|
| `--trim` | on | Strip leading/trailing whitespace including Unicode whitespace (NBSP, NNBSP, ideographic space, etc.) |
| `--collapse-internal` | on (text columns only) | Collapse runs of internal whitespace to a single ASCII space, ONLY in cells that don't parse as numeric, date, or phone-shaped |
| `--unicode-form` | NFC | NFKC available as opt-in; folds ligatures and fullwidth |
| `--smart-quotes` | on | Curly to straight, em/en dash to hyphen, ellipsis to `...`, primes to `'`/`"` |
| `--strip-zero-width` | on | ZWSP, ZWJ, ZWNJ, LRM, RLM, soft hyphen, word joiner |
| `--strip-controls` | on | Strip C0 (except `\t\n\r` inside quoted cells) and DEL |
| `--strip-bom` | on | BOM removed on read; never written on output |
| `--line-endings` | LF | File-level AND embedded-cell line endings normalized to LF |
| `--case` | none | Case operations are opt-in per column |
| `--fix-mojibake` | off | Logged as warning by default; opt-in repair via ftfy |
| `--columns` | all | All text columns processed; `--columns name,email` restricts |

**Idempotency requirement:** for any input X, `clean(clean(X)) == clean(X)`. This is a property test, not a fixture-comparison test. Every fixture below should be run through the cleaner twice and produce identical output both times.

---

## 3. Test case index

| # | File | Category | What it tests | Diff-testable |
|---|---|---|---|---|
| 01 | `01_whitespace_basic.csv` | Whitespace | ASCII space + tab, leading/trailing/internal | Yes |
| 02 | `02_whitespace_unicode.csv` | Whitespace | NBSP, narrow NBSP, ideographic, em/thin space | Yes |
| 03 | `03_smart_punctuation.csv` | Punctuation | Curly quotes, em/en dash, ellipsis, primes | Yes |
| 04 | `04_unicode_forms.csv` | Unicode | NFC vs NFD, ligatures, fullwidth, presentation forms | Yes |
| 05 | `05_zero_width_invisible.csv` | Invisible | ZWSP, ZWJ, ZWNJ, LRM, RLM, soft hyphen | Yes |
| 06 | `06_control_characters.csv` | Control | NUL, BEL, BS, VT, FF, ESC, DEL | Yes |
| 07 | `07_bom_utf8.csv` | Encoding | UTF-8 BOM at file start | Yes (byte-exact) |
| 08 | `08_line_endings_crlf.csv` | Line endings | All CRLF (Windows) | Yes (byte-exact) |
| 09 | `09_line_endings_cr.csv` | Line endings | All CR (classic Mac) | Yes (byte-exact) |
| 10 | `10_line_endings_mixed.csv` | Line endings | CRLF + LF + CR mixed in one file | Yes (byte-exact) |
| 11 | `11_embedded_newlines.csv` | Line endings | Newlines inside quoted cells (preserve, normalize) | Yes |
| 12 | `12_case_variations.csv` | Case | Mixed case across name/email/product columns | 3 outputs (default + 2 modes) |
| 13 | `13_non_latin_scripts.csv` | Preservation | Chinese, Japanese, Arabic, Russian, emoji | Yes |
| 14 | `14_mojibake.csv` | Encoding | Double-encoded UTF-8 (warn-by-default; fix opt-in) | 2 outputs (default + fixed) |
| 15 | `15_whitespace_only_cells.csv` | Boundary (vs 04) | Cells containing only whitespace become empty | Yes |
| 16 | `16_dirty_headers.csv` | Headers | Headers themselves have whitespace, BOM, smart quotes | Yes |
| 17 | `17_preserve_intended.csv` | Negative | Things 02 must NOT touch | Yes |
| 18 | `18_empty_file.csv` | Edge | Zero-byte file | Yes |
| 19 | `19_headers_only.csv` | Edge | Headers but no data rows | Yes |
| 20 | `20_kitchen_sink.csv` | Integration | Everything combined in one file | Yes |
| 21 | `21_excel_pollution.xlsx` | Excel-specific | Multi-sheet, Alt+Enter cells, force-text, copy-paste pollution | No (manual) |

---

## 4. Per-test details

### 01 - Whitespace basic

**File**: `test_data/01_whitespace_basic.csv` -> `expected/01_whitespace_basic.csv`

Tests the core whitespace contract on ASCII space and tab characters. Every kind of placement: leading-only, trailing-only, both, internal-multiple, tab-padded, multiple internal multi-space runs in one cell, all of the above combined.

**Expected behavior:**
- Leading and trailing whitespace stripped from every cell.
- Internal runs of whitespace collapsed to a single ASCII space.
- Tabs treated as whitespace by both rules.

**Why it matters:** This is the highest-frequency real-world pollution. Trailing-space pollution alone is what the v1.5 audit identified as the gap that motivated creating script 02 in the first place (DECISIONS.md v1.5 entry).

---

### 02 - Whitespace, Unicode

**File**: `test_data/02_whitespace_unicode.csv` -> `expected/02_whitespace_unicode.csv`

The whitespace pretenders. Python's `str.strip()` with no argument actually does strip these in 3.x, but a lot of cleaners written by people who were burned in 2.x explicitly pass `' \t\n'` and miss them. Excel and Word produce these constantly when you copy from a styled document.

Characters covered: NBSP (U+00A0), narrow NBSP (U+202F), ideographic space (U+3000), em space (U+2003), thin space (U+2009).

**Expected behavior:** treated identically to ASCII space - trimmed at edges, collapsed internally.

**Why it matters:** "It looks fine but the join doesn't match" debugging sessions almost always end here. NBSP-padded keys are the silent killer.

---

### 03 - Smart punctuation

**File**: `test_data/03_smart_punctuation.csv` -> `expected/03_smart_punctuation.csv`

Curly quotes, dashes, ellipsis, primes - the autocorrect-as-you-type damage from Word/Excel. ASCII-fy where round-trip-safe.

| Input | Output | Notes |
|---|---|---|
| `\u201c` `\u201d` (curly double) | `"` | |
| `\u2018` `\u2019` (curly single) | `'` | Includes apostrophe |
| `\u2014` (em-dash) | `-` | |
| `\u2013` (en-dash) | `-` | |
| `\u2026` (ellipsis) | `...` | |
| `\u2032` (prime) | `'` | |
| `\u2033` (double prime) | `"` | |
| `\u00ab` `\u00bb` (guillemets) | `"` | |
| `\u00d7` (multiplication sign) | **preserved** | Not safely round-trip-able to ASCII; `x` would be wrong |
| `\u00b1` (plus-minus) | **preserved** | Same reasoning |

**Why it matters:** smart-quote pollution breaks regex, breaks downstream parsers, and breaks string equality joins. The two preservation cases (multiplication, plus-minus) are deliberate - they have no faithful ASCII equivalent and forcing one is destructive.

---

### 04 - Unicode normalization forms

**File**: `test_data/04_unicode_forms.csv` -> `expected/04_unicode_forms.csv`

`café` can be encoded two ways:

- NFC: `caf\u00e9` (one code point, e-acute as a unit)
- NFD: `cafe\u0301` (two code points, plain e + combining accent)

These render identically. They compare unequal. They have different lengths. macOS filesystem defaults to NFD, which means a CSV exported from a Mac and joined against a CSV from Excel can silently fail.

Default normalization: NFC (most compact, what Excel emits, what most Western databases expect).

**Cases covered:**
- Pre-composed (NFC) e-acute and i-diaeresis.
- Decomposed (NFD) versions of the same.
- The `\uFB03` `ffi` ligature - **preserved** under NFC (NFKC would fold it to `ffi`).
- Fullwidth Latin letters (`\uFF21\uFF22\uFF23` = `ＡＢＣ`) - **preserved** under NFC.
- Roman numeral nine character (`\u2168`) - **preserved** under NFC.

After cleaning, rows 1 and 2 must produce identical bytes (NFC and NFD both normalized to NFC). Same for rows 3 and 4.

**Why it matters:** Mac-vs-Windows data joins. Catches "they look the same but won't match" bugs.

**Opt-in `--unicode-form=NFKC` test:** not provided as a fixture but should exist as a unit test. Under NFKC, ligature folds to `ffi`, fullwidth folds to ASCII `ABC`, roman numeral folds to `IX`. NFKC is destructive for some legitimate text (mathematical notation, some CJK content) so it stays opt-in.

---

### 05 - Zero-width and invisible characters

**File**: `test_data/05_zero_width_invisible.csv` -> `expected/05_zero_width_invisible.csv`

These bytes show up from rich-text copy/paste, from RTL text, from accidentally-included U+FEFF in the middle of a cell (yes, this happens), and from some web-form pastes.

Characters covered: U+200B (ZWSP), U+200C (ZWNJ), U+200D (ZWJ), U+200E (LRM), U+200F (RLM), U+00AD (soft hyphen), U+2060 (word joiner).

**Expected behavior:** all stripped unconditionally. None of these has a legitimate role in tabular data cells, even when there's a domain reason for them in prose (typesetting Arabic, hyphenation hints in long-form text). For a CSV, they're noise.

**Why it matters:** these are the *truly invisible* polluters. You can stare at the cell forever and not see them. They break joins, they bloat string lengths, they hash differently. The first time a buyer hits a zero-width-space in a customer name, this test is what saves them.

---

### 06 - Control characters

**File**: `test_data/06_control_characters.csv` -> `expected/06_control_characters.csv`

The C0 controls (U+0000 through U+001F) plus DEL (U+007F). Test cases: NUL, BEL, BS, VT, FF, ESC, DEL, and a multi-control combination.

**Expected behavior:** all stripped from cell content.

**The exception:** tab (U+0009), LF (U+000A), and CR (U+000D) are NOT stripped from inside quoted cells. Tab might be intentional formatting; LF/CR are handled by line-ending normalization (case 11). Outside of quoted cells, tab is whitespace and gets normalized like space.

**Why it matters:** real-world exports from broken systems, half-corrupted database dumps, copy-paste from terminals (including ANSI escape sequences starting with ESC), and binary data accidentally exported as text all leave these in cells. A NUL byte mid-string breaks C-string-based parsers; a BEL makes terminals beep when you `cat` the file; ESC sequences corrupt logs.

---

### 07 - UTF-8 BOM

**File**: `test_data/07_bom_utf8.csv` -> `expected/07_bom_utf8.csv` (byte-exact comparison)

File starts with the three-byte sequence `EF BB BF`. Excel writes UTF-8 with BOM by default. Pandas `read_csv` usually handles this but leaves the BOM as part of the first column header name unless you pass `encoding='utf-8-sig'`. Result: a mystery column called `\ufeffid` that breaks every `df["id"]` lookup downstream.

**Expected behavior:**
- BOM stripped on read.
- First column header is the clean string `id`, not `\ufeffid`.
- Output file is written WITHOUT a BOM.

**Diff target:** byte-for-byte equality with `expected/07_bom_utf8.csv`. The expected file must NOT have the BOM.

**Why it matters:** Excel-origin data is the dominant input for the target buyer. Getting BOM handling wrong silently breaks the rest of the pipeline.

---

### 08, 09, 10 - Line endings: CRLF, CR-only, mixed

**Files**: `08_line_endings_crlf.csv`, `09_line_endings_cr.csv`, `10_line_endings_mixed.csv`

- 08: every line ends with CRLF (`\r\n`). Standard Windows.
- 09: every line ends with CR (`\r`) only. Classic Mac. Rare but seen.
- 10: same file contains all three: CRLF, LF, CR, CRLF, LF.

**Expected behavior on output:** all lines end with LF (`\n`). Byte-exact match to the expected files.

**Why LF as the default output:** it's what Linux uses, what every modern code editor handles, what Git stores by default, and what Streamlit / pandas write by default. CRLF is an option for buyers who specifically need Windows-style output, but the default minimizes round-trip surprises.

**Why it matters:** mixed line endings cause "ghost rows" in some parsers, blank lines in some editors, and silent data loss in any tool that splits on one specific newline pattern. Case 10 is the disaster scenario - multi-source concat - and is the most important of the three.

---

### 11 - Embedded newlines inside quoted cells

**File**: `test_data/11_embedded_newlines.csv` -> `expected/11_embedded_newlines.csv`

The trap. File-level line-ending normalization must NOT collapse intentional newlines inside multi-line cells (addresses, notes columns). But the embedded line endings *should still* be normalized to LF for consistency.

**Expected behavior:**
- File-level line endings: LF.
- Embedded CRLF inside a quoted cell: normalized to LF.
- Embedded CR inside a quoted cell: normalized to LF.
- Cell stays multi-line; the newline character count inside the cell is preserved.

**Why it matters:** an address column with `123 Main St\r\nApt 4B\r\nNew York` is the canonical legitimate multi-line cell. A naive `text.replace('\r\n', '\n')` works correctly. A naive `text.split('\n')` to "remove blank lines" destroys the address. The cleaner must understand CSV quoting.

---

### 12 - Case operations (opt-in)

**Files**: input `12_case_variations.csv`; three expected outputs:
- `expected/12_case_variations__default.csv` (no flag - identity)
- `expected/12_case_variations__email_lower.csv` (`--case email=lower`)
- `expected/12_case_variations__name_title.csv` (`--case name=title`)

Default behavior is **preserve case**. Case operations are opt-in per column because:

- Lowercasing emails is almost always right (emails are case-insensitive per RFC 5321 local-part-aside).
- Title-casing names is almost always right (`ALICE SMITH` -> `Alice Smith`), but must handle apostrophes correctly (`O'Connor` -> `O'Connor`, not `O'connor`).
- Lowercasing product codes is almost always WRONG (`SKU-A1B2` is a code, not prose).

So the tool offers per-column case ops, never a global one. The expected outputs cover the two most common configurations.

**Tricky case to verify:** row 4 name `DAN O'CONNOR`. Under `--case=title` this must become `Dan O'Connor`, not `Dan O'connor`. Python's `str.title()` gets this wrong. Implementations should use `string.capwords()` or a regex that respects apostrophes inside words.

**Why it matters:** dedup quality (case 01 in the deduplicator) depends on consistent case in the comparison columns. Buyers running 02 before 01 expect this to "just work" for the email column.

---

### 13 - Non-Latin scripts and emoji (preservation negative test)

**File**: `test_data/13_non_latin_scripts.csv` -> `expected/13_non_latin_scripts.csv`

Negative test: cleaning must not damage characters outside the Latin/punctuation block. Trim and NFC still apply (row 1 has leading and trailing space, which gets trimmed).

Coverage: Chinese (Beijing), Japanese (katakana test), Arabic RTL, Cyrillic Russian, multi-codepoint emoji (party popper U+1F389, rocket U+1F680), accent + emoji combo (`café ☕`).

**Expected behavior:** only whitespace and NFC normalization apply. All script-significant characters preserved exactly.

**Why it matters:** the cleaner must be safe on international buyer data. Stripping "weird-looking" characters because they're outside ASCII is a textbook bug. Emoji in particular are in the supplementary planes (above U+FFFF) and naive byte-level filters often mangle them.

---

### 14 - Mojibake

**Files**: input `14_mojibake.csv`; two expected outputs:
- `expected/14_mojibake__default.csv` (no flag - bytes preserved, warning logged)
- `expected/14_mojibake__fixed.csv` (`--fix-mojibake` - heuristic repair)

Mojibake is the result of UTF-8 bytes being interpreted as cp1252 or Latin-1 and re-saved as UTF-8. Classic patterns:

- `café` becomes `cafÃ©`
- `München` becomes `München`
- `naïve` becomes `naïve`
- The smart-apostrophe in `don't` becomes `don't`

**Default behavior: warn, do NOT auto-fix.** Reasoning: mojibake repair is heuristic, and the heuristic can false-positive on legitimate strings that happen to contain `Ã` followed by another Latin-1 character. The right call for a tool sold to non-experts is to flag the suspicious pattern in the log and let the user opt in.

**With `--fix-mojibake` (uses ftfy or equivalent):** repair attempted. The expected output shows fully repaired text including the smart-apostrophe-via-cp1252 case, which ftfy specifically handles.

**Why it matters:** mojibake is silent corruption. The customer doesn't know it happened until a name shows up wrong on a printed invoice. Flagging it is the responsible default.

---

### 15 - Whitespace-only cells (the 02-vs-04 boundary)

**File**: `test_data/15_whitespace_only_cells.csv` -> `expected/15_whitespace_only_cells.csv`

Per TECHNICAL.md Section 9.3: 02 trims whitespace first, leaving an empty string. Script 04 then detects empty strings as disguised null. So 02's job in this file is to convert `"   "`, `"\t\t"`, `"\u00A0\u00A0"`, and mixed-whitespace cells all into `""`.

**What 02 does NOT do here:**
- Does not decide whether the cell is "missing." That's 04's call.
- Does not write `NaN` or `N/A` or any other sentinel. Just produces empty string.
- Does not drop the row. Schema is invariant.

**Expected behavior:** every whitespace-only cell becomes empty. Row count unchanged. Headers untouched.

**Why it matters:** this is the single most-relitigated boundary in the bundle. Documenting it via fixture prevents drift.

---

### 16 - Dirty headers

**File**: `test_data/16_dirty_headers.csv` -> `expected/16_dirty_headers.csv`

Headers themselves are subject to all the same pollution as data cells. A header `"  Email  "` (NBSP-padded) breaks `df["Email"]` lookups because the actual column name has NBSP padding. Smart-quoted header `"\u201cEmail\u201d"` is even worse.

**Expected behavior:** headers cleaned by the same rules as data. Note that the smart-quoted header `"Email"` (with surrounding quotes) becomes a header value containing literal ASCII double quotes, which then requires CSV-quoting in the output. The expected file is written with proper CSV escaping.

**Why it matters:** broken column names break every downstream join, every selectbox in the GUI, and every CLI flag that takes a column name. Cleaning headers is non-negotiable.

---

### 17 - Preserve-intended (negative tests)

**File**: `test_data/17_preserve_intended.csv` -> `expected/17_preserve_intended.csv`

The negative-test file. Things 02 must NOT touch because they belong to other scripts:

| Cell content | What 02 does | What 02 does NOT do |
|---|---|---|
| `  100  ` | Trims to `100` | Doesn't reformat as `$100.00` (that's 03) |
| `1 234` | Preserves as `1 234` | Doesn't collapse internal space (looks numeric, European thousand-sep) |
| `$1,500.00` | Trims outer whitespace | Doesn't reformat currency (that's 03) |
| `2024-01-15` | Trims outer whitespace | Doesn't reformat date (that's 03) |
| `(555) 123-4567` | Trims outer whitespace | Doesn't reformat phone (that's 03); does not collapse internal space |
| `+1 555 123 4567` | Trims outer whitespace | Same; phone-shaped, leave internal spacing alone |
| `N/A` | Trims to `N/A` | Doesn't replace with empty or NaN (that's 04) |
| `nan` | Trims to `nan` | Doesn't replace with empty or NaN (that's 04) |

The internal-whitespace heuristic: if a cell parses as numeric, looks like a date, or matches a phone-shape regex (digits + common separators), do NOT collapse internal whitespace. Only collapse in cells classified as free text. This requires a per-cell check; document it in the implementation.

**Why it matters:** scope discipline. If 02 starts reformatting dates because "while we're trimming whitespace anyway", it stops being 02 and starts being a worse 03. The DECISIONS.md Section 4a rule (functional scope) cuts the other way too: 02 must not reach into other scripts' territory.

---

### 18 - Empty file

**File**: `test_data/18_empty_file.csv` (zero bytes) -> `expected/18_empty_file.csv` (zero bytes)

**Expected behavior:** graceful no-op. Either produces an empty output file with a logged warning, or emits a clean error message naming the problem ("Input file is empty"). What it MUST NOT do: crash with `pandas.errors.EmptyDataError` traceback in the GUI.

**Why it matters:** error UX standard from DECISIONS.md Section 4b - errors that name the problem and the fix, not stack traces.

---

### 19 - Headers only (no data rows)

**File**: `test_data/19_headers_only.csv` -> `expected/19_headers_only.csv`

Just headers, no data. Headers themselves are dirty (whitespace + NBSP + ZWSP).

**Expected behavior:** headers cleaned, output is clean headers + no data rows. No crash, no warning required (it's a legitimate state).

**Why it matters:** template files often look like this. The buyer might be cleaning a template before populating it. Don't punish them for it.

---

### 20 - Kitchen sink (integration)

**File**: `test_data/20_kitchen_sink.csv` -> `expected/20_kitchen_sink.csv`

The integration test. Combines:

- UTF-8 BOM at file start.
- CRLF line endings throughout.
- Headers with leading/trailing space, NBSP, smart quotes, ZWSP.
- Data cells with NBSP, internal multi-space, smart quotes, em-dash, ellipsis, primes (foot/inch markers).
- A whitespace-only cell that should become empty.
- Multiplication sign (preserved).

**Expected output:** every transformation applied correctly, schema unchanged, file written as UTF-8 (no BOM) with LF line endings.

**Why it matters:** this is the one fixture that catches transformation-order bugs. If smart-quote replacement runs before whitespace trim, you get different output than the other order. Picking and locking the order is part of the implementation; the fixture verifies it.

**Recommended transformation pipeline order** (informative, not normative):

1. Decode bytes -> strip BOM at file level.
2. Normalize file-level line endings -> LF.
3. Parse CSV (with proper quoting for embedded newlines).
4. Per cell, in order:
   a. Unicode NFC normalize.
   b. Strip zero-width and control characters.
   c. Strip BOM if it appears mid-cell.
   d. Smart-quote ASCII-fy.
   e. Normalize embedded line endings to LF.
   f. Whitespace trim (outer).
   g. Internal whitespace collapse (text columns only - check after trim).
   h. Per-column case op (if configured).
5. Headers go through the same per-cell pipeline.
6. Write as UTF-8, LF line endings, no BOM.

---

### 21 - Excel pollution (multi-sheet XLSX)

**File**: `test_data/21_excel_pollution.xlsx` (no expected file - manual / programmatic verification per sheet)

Four sheets, each isolating an Excel-specific concern:

**Sheet `Customers`** - dirty headers (NBSP, smart quotes, ZWSP) and dirty data cells (NBSP padding, tab padding, smart apostrophe in `O'Connor`, em-dash). One whitespace-only `name` cell to verify the 02/04 boundary applies on XLSX too.

**Sheet `Notes`** - multi-line cells from Alt+Enter (LF inside cell), plus a cell with mixed CRLF inside (from someone pasting Windows text into Excel). Cells have wrap_text formatting set so the line breaks render in Excel. After cleaning, all in-cell line breaks should be LF.

**Sheet `International`** - non-Latin scripts and emoji with surrounding whitespace. Verifies the preservation contract from case 13 holds for XLSX.

**Sheet `ForceText`** - leading-zero IDs (e.g., `0001234`). These must not be stripped of leading zeros (that's not 02's job - it doesn't change semantic content). Row 3 has a leaked apostrophe (`'9999999`) from a force-text cell - this is a judgment call but the default is to preserve it; trying to detect "leaked apostrophe" is too error-prone.

**Why it matters:** XLSX has pollution patterns that don't appear in CSV (Alt+Enter cells, force-text apostrophes, sheet structure). The XLSX reader path needs the same cleaning logic as the CSV reader path; this fixture verifies that.

---

## 5. What this corpus does NOT cover

Listed so the gap is explicit, not hidden:

1. **Encoding detection** (cp1252 input, Latin-1 input, UTF-16). That's the I/O layer's job, not 02's transformation logic. Once the reader produces a Python `str`, 02 operates the same regardless of source encoding. Add I/O-layer fixtures separately when that layer is built.
2. **Performance / large files**. No multi-GB fixture is included because it bloats the repo. Add a benchmark (not a unit test) targeting a 500MB CSV; verify processing completes without OOM via chunked reads.
3. **Streamlit UI behavior**. The fixtures verify cleaning correctness; verifying the GUI shows the right preview, applies the right defaults, and renders cleaning in the diff view is a separate test layer (probably manual, possibly Playwright).
4. **Concurrency / file-locking** (e.g., user has the input file open in Excel). Expected to fail with a clean error, not corrupt data. Add a manual test, not a fixture.
5. **CLI argument parsing** for the various flags. Each flag should have a Typer-level test, separate from the fixtures here.

---

## 6. How to use this corpus

### As a build target
Each fixture is one piece of the spec. Implement the cleaner against fixture 01, run, diff, fix, repeat. Move to 02. By the time fixture 20 passes, the script is done.

### As pytest fixtures
```python
import pytest
from pathlib import Path
from src.core.text_cleaner import clean_csv

CORPUS = Path("tests/corpus")  # wherever this folder lands

@pytest.mark.parametrize("name", [
    "01_whitespace_basic",
    "02_whitespace_unicode",
    "03_smart_punctuation",
    "04_unicode_forms",
    "05_zero_width_invisible",
    "06_control_characters",
    "07_bom_utf8",
    "08_line_endings_crlf",
    "09_line_endings_cr",
    "10_line_endings_mixed",
    "11_embedded_newlines",
    "13_non_latin_scripts",
    "15_whitespace_only_cells",
    "16_dirty_headers",
    "17_preserve_intended",
    "18_empty_file",
    "19_headers_only",
    "20_kitchen_sink",
])
def test_default_config(name, tmp_path):
    inp = CORPUS / "test_data" / f"{name}.csv"
    expected = (CORPUS / "expected" / f"{name}.csv").read_bytes()
    out = tmp_path / "out.csv"
    clean_csv(inp, out)  # default config
    assert out.read_bytes() == expected

# Cases 12 and 14 have multiple expected files; parametrize them separately
# with the relevant flags.

# Idempotency property test - applies to every fixture:
@pytest.mark.parametrize("name", [...same list...])
def test_idempotent(name, tmp_path):
    inp = CORPUS / "test_data" / f"{name}.csv"
    out1 = tmp_path / "out1.csv"
    out2 = tmp_path / "out2.csv"
    clean_csv(inp, out1)
    clean_csv(out1, out2)
    assert out1.read_bytes() == out2.read_bytes()
```

### Regenerating fixtures
If a default policy changes (e.g., switch the default Unicode form from NFC to NFKC, which would be a meaningful policy decision), the fixtures in `expected/` need regenerating. Edit `generate_test_data.py` and re-run. Document the policy change in DECISIONS.md before doing this.
