# QUOTE-CASES.md - Quote Variants and Malformed CSV Test Cases

**Version**: 1.0
**Last updated**: April 29, 2026
**Companion to**: `TEST-CASES.md` (extends it with cases 22 and 23)

This document covers two complementary fixtures that test different things:

| File | Tests |
|---|---|
| `test_data/22_quote_variants.csv` | The cleaner's *character transformation* logic. Every cell is in well-formed CSV; the question is what each Unicode quote-mark variant becomes after cleaning. |
| `test_data/23_csv_malformed.csv` | The cleaner's *parser robustness*. The CSV structure itself is broken in 24 different ways (plus 3 cascade-destructive cases). The question is whether the cleaner reads it gracefully, errors usefully, or crashes. |

---

## 1. Why two files instead of one

Combining quote characters and structural malformations into one file conflates two different test concerns. The cleaner's character-transformation logic is exercised by the *contents* of cells; the parser-robustness logic is exercised by the *structure* of the file. Mixing them means a failure in one masks failure in the other.

Concretely: if file 22 fails to produce expected output, the bug is almost certainly in the cleaning rules (NFC, smart-quote folding, etc.). If file 23 produces unexpected output, the bug is in the parser layer or the error-handling policy. Different layers, different fixes.

---

## 2. File 22: `22_quote_variants.csv`

**Structure**: 5-column well-formed CSV.

```
case_id,category,char_name,codepoint,payload
```

37 rows total: 1 header + 36 cases. All rows parse to exactly 5 fields under any compliant CSV parser.

### 2.1 Categories covered

| Category | Cases | Characters | What it tests |
|---|---|---|---|
| ascii | Q01-Q02 | `"` `'` | Baseline / negative control. Should pass through unchanged. |
| curly | Q03-Q08 | `\u201C` `\u201D` `\u2018` `\u2019` | Word/Outlook autocorrect output. Default cleaner policy: fold to ASCII. |
| low | Q09-Q13 | `\u201E` `\u201A` `\u201F` `\u201B` | German, Czech, Polish typography. Default policy: fold to ASCII. |
| guillemet | Q14-Q18 | `\u00AB` `\u00BB` `\u2039` `\u203A` | French and Russian quotation. Default policy: fold to ASCII `"` and `'`. |
| fullwidth | Q19-Q20 | `\uFF02` `\uFF07` | CJK fullwidth. Default policy: fold to ASCII under NFKC mode; preserve under NFC. |
| cjk | Q21 | `\u300C` `\u300D` | Japanese corner brackets. Default policy: PRESERVE (these are punctuation, not quote-folding territory). |
| prime | Q22-Q24 | `\u2032` `\u2033` | Foot/inch and minute/second markers. Default policy: fold to ASCII `'` and `"`. |
| heavy | Q25-Q27 | `\u275B` `\u275C` `\u275D` `\u275E` | Decorative quotes. Default policy: fold to ASCII. |
| modifier | Q28-Q30 | `\u02BC` `\u02B9` `\u02BA` | Modifier letters that LOOK like quotes but are letters. **Default policy: PRESERVE.** Folding `\u02BC` (the Hawaiian okina) destroys real linguistic meaning. |
| mixed | Q31-Q36 | combinations | Real-world chaos: nested, asymmetric, partial, all-three-styles-in-one-cell. |

### 2.2 The hard policy decisions

Three cases force a decision the cleaner must make explicitly:

**Q21 — CJK corner brackets `\u300C` `\u300D`**: these are punctuation marks in Japanese text, semantically equivalent to quotation marks but NOT something an ASCII cleaner should fold. The default policy preserves them. A `--aggressive-fold` mode could fold them, but it's destructive.

**Q28 — Hawaiian okina `\u02BB`** (used in the payload via `\u02BC` modifier letter apostrophe in this fixture): looks like an apostrophe. Is a real letter. Folding `Hawai\u02BBi` to `Hawai'i` is wrong by the rules of Hawaiian orthography. **Preserve.**

**Q33, Q34 — asymmetric curly quotes** (open curly + close ASCII, or vice versa): the cleaner's choices are (a) fold both to ASCII (loses information that the input was malformed), (b) preserve both (does nothing useful), or (c) fold and log a warning. Recommended: option (c).

### 2.3 Suggested workflow for analyzing this file

1. Run the cleaner on `22_quote_variants.csv` with default settings.
2. Diff the output's `payload` column against the input's `payload` column.
3. For each row where the payload changed, verify the change matches the documented policy for that category.
4. For each row where the payload did NOT change, verify preservation was the intended policy (Q21, Q28, etc.).
5. Toggle `--aggressive-fold` (or whatever flag the cleaner uses for NFKC + CJK folding) and re-run; verify Q19, Q20, Q21 now fold and Q28-Q30 still don't.

---

## 3. File 23: `23_csv_malformed.csv`

**Structure**: 2-column nominal header (`case_id,payload`), but most rows do NOT actually have 2 fields by design. The malformations themselves break field count, quoting, or row terminators.

**Layout** (28 lines total):
- Line 1: header
- Lines 2-25: 24 bounded malformations (M01-M24). Each row's malformation is contained: it disturbs that row's parsing but does not cascade.
- Line 26: BANNER_DANGER_ZONE marker row (well-formed; just a visual divider).
- Lines 27-28: 3 cascade-destructive cases (M90, M91, M92). These can swallow rows after themselves; they're at the end so the cascade only consumes itself.

### 3.1 The bounded malformations (M01-M24)

| Case | Malformation | Field count when parsed | Notes |
|---|---|---|---|
| M01 | Unquoted comma in cell | 5 (expected 2) | Simplest case. `Smith, John` becomes 2 fields. |
| M02 | Quoted comma in cell | 4 | **Well-formed control.** Should parse without error. |
| M03 | Stray ASCII `"` mid-unquoted-cell | 4 | Many parsers tolerate, some choke on the unbalanced `"`. |
| M04 | Stray curly quotes mid-unquoted-cell | 4 | Curly quotes are data to ASCII parsers; comma still splits. |
| M05 | Quoted cell with UNESCAPED inner `"` | varies | **Pandas typically SKIPS this with a warning.** |
| M06 | Quoted cell with properly escaped inner `""` | 4 | **Well-formed control.** |
| M07 | Backslash-escaped inner quotes | varies | RFC 4180 doesn't recognize `\"`. Pandas typically SKIPS. |
| M08 | Cell wrapped in single quotes `'value'` | 4 | Apostrophes are data, not quotes, in standard CSV. |
| M09 | Cell wrapped in CURLY quotes | 5 | Curly quotes are data; the embedded comma still splits. |
| M10 | ASCII open + curly close | varies | Pandas SKIPS — the open `"` never finds an ASCII close. |
| M11 | Whitespace outside quotes | 4 | Pandas tolerates; csv.reader strict mode rejects. |
| M12 | Empty quoted cell | 4 | **Well-formed control.** |
| M13 | Whitespace-only quoted cell | 4 | Whitespace must be preserved (it's quoted). |
| M14 | Apostrophe in name (`O'Connor`) | 4 | **Negative control.** `'` is not a quote char in CSV. |
| M15 | Excel force-text leading apostrophe | 4 | csv.reader leaves the `'` as data; Excel itself strips on read. |
| M16 | Triple ASCII quotes `"""value"""` | 4 | Parses as `"` (escaped to `""`) + value + `"` — survives. |
| M17 | Quadruple ASCII quotes `""""value""""` | varies | Pandas SKIPS this one. |
| M18 | Row has 5 fields where header expects 2 | 5 | Field-count overflow. |
| M19 | Empty payload (just `case_id,`) | 2 | Trailing empty field. |
| M20 | Quoted multi-line cell (LF inside) | 4 | **Well-formed control.** Row physically spans 2 lines in the file. |
| M21 | Bare LF in UNQUOTED cell | varies | csv.reader rejects strict; the LF gets treated as row terminator. Cascades by ONE row. |
| M22 | Curly outer + ASCII inner + comma | 5 | Word-paste pattern. Curly is data; ASCII inner triggers quoting state mid-cell. |
| M23 | Tab character in unquoted cell | 4 | **Negative control.** Tab is data, not delimiter. |
| M24 | Bare CR in unquoted cell | error | csv.reader raises; pandas typically eats one row of cascade. |

### 3.2 The cascade-destructive cases (M90-M92)

These are the real torture tests. Each is in its own row at the END of the file so the cascade can only damage the destructive cases themselves, not the bounded section above.

| Case | Malformation | Cascade behavior |
|---|---|---|
| M90 | Opening `"` with no close anywhere on the same line | Consumes characters until it finds another `"` or EOF. In this file's layout, M90 may swallow M91. |
| M91 | Opens with internal newline, never closes | Multi-line cascade. May swallow M92. |
| M92 | File ends mid-quoted-cell, no trailing newline | Pandas raises "unexpected end of data" warning; the cell content is lost. |

### 3.3 Observed pandas behavior on this file

For reference, `pandas.read_csv(..., on_bad_lines='skip', engine='python')` produces these specific outcomes (verified during fixture generation):

- **Skipped with warning**: M05, M07, M10, M17, M90
- **Cascaded (case_id of next row appears as data in this row's columns)**: M21, M24
- **Multi-line consumption**: M91 swallows M92's line
- **Final position**: 23 rows survive (out of 24 bounded + banner + 3 destructive = 28 expected)

The exact list of skipped rows is parser-version-specific. Verify against the cleaner's actual parser layer; do not hard-code the expected skip list.

### 3.4 Suggested workflow for analyzing this file

1. Run the cleaner on `23_csv_malformed.csv` with default error policy.
2. Inspect: did the cleaner crash, warn, or silently skip?
3. Match each surviving row's `case_id` to the table in 3.1; verify the cleaner produced an output that matches the documented behavior for that case.
4. For the bounded section (M01-M24): verify NO bounded case caused a cascade into the next row. If the cleaner's parser is causing M21 or M24 to swallow the next row, that's a parser-policy choice worth documenting.
5. For the destructive section (M90-M92): verify the damage is contained. Nothing above the BANNER_DANGER_ZONE row should be affected by anything below it. If it is, the parser is reading destructively.
6. Toggle the cleaner's error policy (`strict` vs `skip` vs `warn`) and re-run. The bounded malformations should produce DIFFERENT outcomes under each policy; that's how you verify the policy flag is actually doing something.

---

## 4. What this corpus does NOT cover

Listed so the gaps are explicit:

1. **Different delimiters**: this corpus uses comma. Tab-delimited and pipe-delimited files have their own pathologies (and many of these malformations don't apply or apply differently). Add separate fixtures if non-comma delimiters are in scope.
2. **Different encodings**: every file here is UTF-8. cp1252, Latin-1, UTF-16 introduce orthogonal problems handled by the I/O layer, not the cleaning layer.
3. **Compressed/archived inputs**: `.csv.gz`, `.zip` containing CSV. Out of scope for the text cleaner per the script-boundary rules in TECHNICAL.md Section 9.
4. **Streaming/append patterns**: where the file is being written while being read, or where rows are appended over time. Not relevant for the script's batch-processing model.
5. **Adversarial inputs**: e.g., a deliberately crafted CSV designed to crash a specific parser (CSV injection / formula injection). Security territory, not data-cleaning territory.

---

## 5. How to extend the fixture

If a new quote variant or malformation pattern surfaces in customer data, add it via the existing generator:

```python
# In generate_quote_test_files.py, append to quote_rows or bounded_rows.
# Re-run: python generate_quote_test_files.py
```

Naming: continue the `Q##` and `M##` numbering. Keep `M9#` reserved for cascade-destructive cases. Update the tables in this document.

If a fundamentally new CATEGORY emerges (e.g., a Section J for some new quote-character family in a customer's regional data), add a new section to this document; don't bury it in `mixed`.