# Text Cleaner Test Corpus

Test fixtures for `02_text_cleaner.py` (Excel & CSV Data Cleaning Mastery Bundle).

## Layout

```
text_cleaner_test_corpus/
├── README.md                            # This file
├── TEST-CASES.md                        # Full taxonomy for cases 01-21 (text cleaner 02)
├── QUOTE-CASES.md                       # Quote variants and malformed CSV (22, 23)
├── ENCODINGS-CASES.md                   # Code page / encoding tests (E01-E31)
├── FORMATS-CASES.md                     # Format standardizer 03 (dates, phones, emails, addresses, names, currencies)
├── generate_test_data.py                # Regenerates 20 CSV inputs and expected outputs
├── generate_xlsx.py                     # Regenerates the multi-sheet XLSX fixture
├── generate_quote_test_files.py         # Regenerates 22 (quote variants) and 23 (malformed)
├── generate_encoding_test_files.py      # Regenerates E01-E31 plus manifests
├── generate_format_test_files.py        # Regenerates 24-30 (format domains)
├── test_data/                           # Inputs
│   ├── 01-23 main fixtures
│   ├── 21_excel_pollution.xlsx
│   ├── encodings/                       # E01-E31 encoded test files + manifests
│   └── formats/                         # 24-30 format-standardizer fixtures
└── expected/                            # Expected outputs
    └── formats/                         # 24-30 format-standardizer expected outputs
```

## Quick start

Read `TEST-CASES.md` from top to bottom. Sections 1 (scope boundary) and 2 (default config assumed) are load-bearing; the per-test details in Section 4 don't make sense without them.

To regenerate the test files (e.g., after editing the generator):
```bash
python generate_test_data.py
python generate_xlsx.py
python generate_quote_test_files.py
python generate_encoding_test_files.py
python generate_format_test_files.py
```

To use as pytest fixtures: see Section 6 of `TEST-CASES.md`.

## Coverage summary

| Category | Fixtures |
|---|---|
| Whitespace (ASCII + Unicode) | 01, 02 |
| Smart punctuation | 03 |
| Unicode normalization | 04 |
| Invisible / zero-width / control | 05, 06 |
| BOM | 07 |
| Line endings (file-level + embedded) | 08, 09, 10, 11 |
| Case operations (opt-in) | 12 |
| International script preservation | 13 |
| Mojibake | 14 |
| Boundary with script 04 (missing values) | 15 |
| Headers | 16, 19 |
| Negative tests (must NOT touch) | 17 |
| File-level edge cases | 18, 19 |
| Integration | 20 |
| Excel-specific (multi-sheet, Alt+Enter) | 21 |
| Quote-character variants (Word, slanted, low, guillemet, prime, etc.) | 22 |
| Malformed CSV structure (unquoted commas, unbalanced quotes, etc.) | 23 |
| Code pages / encodings (UTF-8/16, cp1252, Latin-1/2/9, Mac Roman, cp1250/1251, KOI8-R, Shift_JIS, GB18030, Big5, EUC-KR, pathological) | E01-E31 |
| Format standardizer 03 - dates (ISO/US/EU/Excel-serial/Unix/locale/edge) | FD01-FD45 |
| Format standardizer 03 - phones (US/intl/E.164/extensions/vanity/edge) | FP01-FP31 |
| Format standardizer 03 - emails (basic/Gmail/IDN/display-name/invalid) | FE01-FE31 |
| Format standardizer 03 - addresses (USPS/case/abbrev/multiline/non-US) | FA01-FA31 |
| Format standardizer 03 - names (case/Mc/O'/particles/titles/comma) | FN01-FN34 |
| Format standardizer 03 - currencies (US/EU/intl/negative/locale-ambig) | FC01-FC27 |
| Format standardizer 03 - cross-domain integration | FI01-FI05 |

## Out of scope

Documented in `TEST-CASES.md` Section 5: encoding detection, large-file performance, GUI behavior, file-locking, CLI argument parsing. Each needs its own test layer.