# TECHNICAL.md - Technical Design, Build Pipeline, Standards

> **Creator-only document. Do not ship to buyers.**

**Version**: 1.6
**Last updated**: April 28, 2026

---

## 1. Architecture Overview

- Standalone tools with **dual interface**: CLI and GUI, both wrapping the same core library.
- GUI framework: **Streamlit**. Runs as a local web server, opens in the buyer's default browser. No internet used.
- Python 3.11+ runtime (bundled into the installer; the buyer never installs Python).
- Modular code, one concern per script. Core logic is library code; CLI and GUI are thin front-ends.
- Cross-platform from day one: Windows, macOS, Linux.
- PyInstaller produces standalone executables per OS. Buyer never sees Python, pip, venvs, or PATH.
- No internet required at runtime.

**Why dual interface (locked v1.2)**: The primary buyer persona is non-technical and will not use a CLI. The GUI is therefore the primary surface and is required at v1, not deferred. The CLI is retained for power users, automation, scheduled jobs, and future scripted workflows. Both share a single core; neither has features the other lacks (except interactive review, which only makes sense in GUI).

**Why Streamlit (locked v1.3)**: Fastest build velocity, lowest maintenance burden per added feature, hosted browser demo deployable as a marketing asset, future SaaS optionality. Selected over CustomTkinter (maintenance inactive since Jan 2024), plain Tkinter (UX gap at this price tier), Flet (ecosystem too young), PySide6 (overkill), and NiceGUI (smaller community). Full rationale in DECISIONS.md Section 4c.

This is a major change from the original Inno-Setup-only, CLI-only design. Rationale chain:
1. Requiring a buyer to install Python before using the product is the largest source of install friction (solved by PyInstaller in v1.1).
2. Requiring a non-technical buyer to use a CLI is the second-largest source of refund risk (solved by dual interface in v1.2).
3. Betting the GUI on an unmaintained library is the largest hidden technical risk (solved by Streamlit choice in v1.3).

---

## 2. Standard Bundle Structure (source repo)

Every bundle follows this layout in source. Core logic is shared, CLI and GUI are thin front-ends.

```
bundle-name/
├── src/
│   ├── __init__.py
│   ├── core/                # Shared business logic. No UI code here.
│   │   ├── __init__.py
│   │   ├── dedup.py         # (example) the actual algorithm
│   │   └── io.py            # File I/O, encoding/delimiter detection, etc.
│   ├── cli.py               # Command-line interface (Typer). Thin wrapper over core.
│   └── gui/                 # Streamlit front-end. Thin wrapper over core.
│       ├── __init__.py
│       ├── app.py           # Main Streamlit entry point (st.set_page_config, layout)
│       ├── pages/           # Streamlit multi-page app (one page per script in the bundle)
│       │   ├── 1_Deduplicator.py
│       │   ├── 2_Text_Cleaner.py
│       │   ├── 3_Format_Standardizer.py
│       │   └── ...
│       └── components.py    # Reusable Streamlit widgets and helpers
├── data_examples/           # Sample input files
├── tests/                   # Unit tests (pytest). Tests target core, not UI.
├── build/
│   ├── pyinstaller.spec     # PyInstaller build spec (handles both CLI + GUI entry points)
│   ├── launcher.py          # Small launcher script: starts Streamlit server, opens browser
│   ├── windows/
│   │   └── installer.iss    # Inno Setup wrapper for Windows .exe installer
│   ├── macos/
│   │   ├── entitlements.plist
│   │   └── dmg_settings.py  # dmg-creation config
│   └── linux/
│       └── AppImage/        # AppImage build assets
├── demo/                    # Stripped-down version for hosted browser demo
│   └── streamlit_app.py     # Entry point for Streamlit Community Cloud deployment
├── requirements.txt
├── README_bundle.md         # User-facing guide (covers both CLI and GUI usage)
├── LICENSE
└── ci/
    └── build.yml            # GitHub Actions cross-platform build
```

**Core/UI separation rule**: A new feature is implemented in `core/` first, with tests. CLI and GUI both call into core. If a feature exists only in one front-end (e.g., interactive review only in GUI), the underlying capability still lives in core; only the presentation differs.

**Demo subfolder rule**: The `demo/` folder contains a constrained Streamlit app for public deployment to Streamlit Community Cloud. Constraints: row limit (e.g., 100 rows max output), no file save, watermark on output, sample dataset only or strict file-size cap. Same core library, different front-end constraints.

---

## 3. Cross-Platform Build Pipeline

### 3.1 Tooling

| Concern | Tool |
|---|---|
| Bundling Python + scripts into a standalone binary | PyInstaller |
| GUI framework | **Streamlit** |
| Browser launch from launcher | Python `webbrowser` module (stdlib) |
| CLI framework | Typer |
| Windows installer wrapper | Inno Setup (free) |
| macOS bundle format | `.app` packaged in `.dmg` |
| macOS code signing & notarization | `codesign` + `notarytool` (built into Xcode command line tools) |
| Linux distribution format | AppImage (primary) + plain tarball (fallback) |
| CI / automated builds | GitHub Actions (free tier handles all three OS runners) |
| Hosted demo | Streamlit Community Cloud (free) or $5/mo VPS |

### 3.2 Build Outputs (what the buyer downloads)

| Platform | Output file | Buyer experience |
|---|---|---|
| Windows | `BundleName-Setup-1.0.exe` | Double-click installer, click through wizard. Desktop shortcut "Launch Bundle" runs `launcher.py`, which starts the local Streamlit server and opens default browser to `http://localhost:8501`. CLI executables also installed and on PATH. |
| macOS | `BundleName-1.0.dmg` | Double-click DMG, drag app to Applications. Signed and notarized. Launching the app runs the launcher, which starts the local server and opens the browser. CLI binaries shipped in the app bundle. |
| Linux | `BundleName-1.0.AppImage` | Mark executable, double-click. AppImage runs the launcher, opens browser. Tarball fallback also includes CLI binaries. |

The **default buyer experience on every platform is**: double-click, browser opens, work done. The CLI is present, documented, and on PATH for users who want it.

**Browser-launch UX mitigation** (per DECISIONS.md Section 4c tradeoff): The launcher script displays a brief "Starting your data tool..." console message before opening the browser. The Streamlit app's first page includes a one-line note: *"This tool runs locally in your browser and does not use the internet."* Install email reinforces the same message.

### 3.3 PyInstaller Configuration

Single `.spec` file per bundle, parameterized for OS. Key settings:

- `--onefile` for Linux (single AppImage), `--onedir` for Windows and macOS (faster startup, easier signing).
- All dependencies bundled. No internet required at runtime.
- Hidden imports declared explicitly for pandas/openpyxl/Streamlit edge cases (PyInstaller's auto-detection misses some).
- Icon files per platform (`.ico` for Windows, `.icns` for macOS, `.png` for Linux).
- **Two entry points per bundle**: the GUI launcher (default, what the desktop shortcut runs) and the CLI binaries.
- **Streamlit-specific PyInstaller hooks**: include the `streamlit` data directory, the `altair` data directory (Streamlit dependency), and the `pyarrow` C extensions. Add a custom hook file (`hook-streamlit.py`) per the documented pattern. Budget 1-3 days the first time getting the spec right; reuse across all subsequent bundles.

### 3.4 Streamlit Launcher Pattern

The launcher script handles starting the local Streamlit server in a way that survives PyInstaller bundling. Conceptual outline:

1. Find a free local port (avoid hardcoding 8501 in case of conflict).
2. Set Streamlit environment variables: `STREAMLIT_SERVER_HEADLESS=true`, `STREAMLIT_BROWSER_GATHER_USAGE_STATS=false`, `STREAMLIT_SERVER_PORT={port}`.
3. Start Streamlit programmatically (via `streamlit.web.cli.main_run` or `bootstrap.run`) in a background thread.
4. Wait for the port to accept connections (poll with timeout).
5. Open the buyer's default browser to `http://localhost:{port}` via `webbrowser.open()`.
6. Keep the launcher process alive while the server runs. Detect server shutdown and exit cleanly.

Optional v1.1 enhancement: replace step 5 with a `pywebview` window that wraps the local server. Eliminates the "default browser opens" UX surprise. Adds a dependency and some packaging complexity. Defer until support tickets show the browser-launch is causing meaningful confusion.

### 3.5 macOS Signing & Notarization Pipeline

Required setup (one-time):
1. Enroll in Apple Developer Program ($99/yr - see BUSINESS.md Section 10).
2. Generate Developer ID Application certificate via Apple Developer portal.
3. Install certificate in macOS keychain on the build machine (or store as encrypted GitHub Actions secret for CI).
4. Generate an app-specific password for `notarytool`.

Build-time flow (automated):
1. PyInstaller produces unsigned `.app`.
2. `codesign --deep --force --options runtime --sign "Developer ID Application: [Your Name]" BundleName.app`
3. Package into `.dmg`.
4. Submit `.dmg` to Apple notary service: `xcrun notarytool submit BundleName.dmg --wait`.
5. Staple the notarization ticket: `xcrun stapler staple BundleName.dmg`.
6. Output is the final, distributable `.dmg`.

Buyers on macOS see no Gatekeeper warnings. Clean install.

### 3.6 Windows Pipeline

1. PyInstaller produces `BundleName/` folder with launcher `BundleName.exe` (which opens the GUI in browser) plus CLI binaries plus dependencies.
2. Inno Setup script wraps the folder into `BundleName-Setup-1.0.exe`.
3. Installer creates Start Menu entry, desktop shortcut (launches GUI), optional Add/Remove Programs entry, and adds CLI binaries to PATH.
4. Optional Windows code signing certificate (~$200-400/yr from a CA) eliminates SmartScreen warnings. **Not required at launch**; revisit if SmartScreen friction shows up in support tickets.

### 3.7 Linux Pipeline

1. PyInstaller produces single-file binaries per entry point.
2. Wrap in AppImage using `appimagetool` (free, well-documented). AppImage runs the launcher as the default target.
3. Provide a plain `.tar.gz` fallback for users on distributions where AppImage fails. Tarball includes both GUI launcher and CLI binaries plus a `run.sh`.
4. No signing required on Linux; users execute `chmod +x` then double-click or run.

### 3.8 CI Build Matrix

GitHub Actions builds all three platforms on tagged release:

```yaml
# Conceptual, full file lives in ci/build.yml
strategy:
  matrix:
    os: [windows-latest, macos-latest, ubuntu-latest]
```

Result: one git tag triggers three platform builds. Artifacts upload to GitHub Releases. Manual step: copy artifacts to Gumroad / Lemon Squeezy product page.

### 3.9 Hosted Demo Deployment

Separate from the desktop build pipeline. The `demo/streamlit_app.py` entry point is deployed to Streamlit Community Cloud:

1. Connect the GitHub repo to Streamlit Community Cloud (one-time).
2. Configure the app to deploy from the `demo/` folder, main branch.
3. Set deployment-time environment variables (e.g., row limits, watermark flag).
4. App is publicly accessible at a `*.streamlit.app` URL. Link from Gumroad landing page.
5. Optional: custom domain via CNAME (free with Streamlit Community Cloud as of last check; verify before locking).

If Streamlit Community Cloud is ever unsuitable (rate limits, bandwidth, branding requirements), fall back to a $5/mo VPS running the demo via Docker. Same `demo/streamlit_app.py`, different host.

---

## 4. Libraries

| Purpose | Library |
|---|---|
| GUI framework | Streamlit |
| CLI framework | Typer |
| Data manipulation | pandas, openpyxl, numpy |
| Fuzzy string matching | rapidfuzz |
| File encoding detection | charset-normalizer |
| Logging | loguru |
| Progress bars | tqdm (CLI), `st.progress` (GUI) |
| Validation | pydantic (optional) |
| Reports | ReportLab (PDF), pandas styling (Excel) |
| Optional native window wrap | pywebview (deferred to v1.1 if needed) |

`requirements.txt` (current bundle, v1.3):

```
streamlit>=1.30
pandas
openpyxl
numpy
typer
rapidfuzz
charset-normalizer
loguru
tqdm
reportlab
pyarrow  # Streamlit dependency, declare explicitly for PyInstaller clarity
altair   # Streamlit dependency, declare explicitly for PyInstaller clarity
```

---

## 5. Coding Standards

### 5.1 Code Standards

- PEP 8 + type hints on all public functions.
- Docstrings on every module and public function.
- `--help` output (CLI) that a non-technical user can act on.
- Graceful error handling with human-readable messages, not stack traces. Errors must name the problem AND the likely fix where possible (e.g., "Column 'email' not found. Available columns: name, phone. Did you mean 'phone'?").
- All file paths handled via `pathlib.Path`, never string concatenation. Cross-platform correctness depends on this.
- All file I/O explicitly UTF-8-aware: detect encoding on input (charset-normalizer), write UTF-8 on output. Windows defaults to cp1252 otherwise.
- No platform-specific shell calls. If absolutely needed, branch on `sys.platform`.
- Unit tests for core logic (pytest). Tests target `core/`, not UI front-ends. Tests run on all three OS runners in CI.
- Semantic versioning per bundle.
- **Core/UI separation**: never put business logic in `cli.py` or `gui/`. If a CLI command and a GUI button do "the same thing," they call the same function in `core/`.

### 5.2 UX Standards (GUI / Streamlit) - load-bearing per DECISIONS.md Section 4b

- **Works out of the box**: dropping a file into the Streamlit `st.file_uploader` must produce a useful result with zero configuration.
- **Sensible defaults visible everywhere**: every `st.selectbox`, `st.slider`, `st.checkbox` has a default, the default is shown, the user is not forced through a config screen on first run.
- **Progressive disclosure**: basic view shows file uploader + go button + results. Advanced options live in `st.expander("Advanced options")` panes.
- **Plain-English labels**: no technical jargon in primary UI. `help=` parameter on widgets carries technical detail for users who want it.
- **Dry-run / preview by default**: user sees what would change before any file is written. Original input is never modified.
- **Single-page completion**: basic task completes on a single Streamlit page. Multi-page apps are for separate scripts in the bundle, not for multi-step wizards within one script.
- **Identical core to CLI**: any capability available in CLI is available in GUI, and vice versa. The only legitimate divergence is interactive review (GUI-natural) and scripted/scheduled execution (CLI-natural).
- **Local-first messaging**: every GUI page includes the line *"This tool runs locally in your browser and does not use the internet"* in a small, persistent location (e.g., footer or sidebar).

### 5.3 Functional Scope Standard - load-bearing per DECISIONS.md Section 4a

- Each script ships with **complete coverage of the workflow it names**, including features available free elsewhere (e.g., exact-match dedup).
- Scope boundary is the workflow, not "things adjacent to the workflow." A deduplicator includes normalization, survivor selection, audit. It does not include format conversion or charting; those belong elsewhere in the bundle.

---

## 6. System Requirements

**For buyers (runtime)**:
- Windows: Windows 10 or 11, 64-bit.
- macOS: macOS 11 (Big Sur) or later, Apple Silicon or Intel.
- Linux: any glibc-based distribution from 2020 onward (Ubuntu 20.04+, Fedora 33+, etc.).
- A modern default browser (Chrome, Edge, Firefox, Safari from the last 3 years). Used to display the local GUI; no internet required.
- ~400-500 MB free disk space (Streamlit packaging is larger than alternatives; this is an accepted tradeoff per DECISIONS.md Section 4c).
- No internet required after install. No Python install required ever.

**For developers (you)**:
- Python 3.11+.
- PyInstaller, Inno Setup (Windows builds), Xcode command line tools (macOS builds).
- Apple Developer Program membership ($99/yr) for macOS distribution.
- Git + GitHub account (for CI builds and Streamlit Community Cloud deployment of demos).

---

## 7. Per-Bundle Technical Notes

| Bundle | Status | Tech notes |
|---|---|---|
| Data Cleaning Mastery | Lead, 1/9 scripts complete (CLI only; needs Streamlit GUI port) | Cleaning, dedup, text hygiene, standardization, missing-value handling, outlier detection, type coercion, reporting. Scripts 04 (missing values) and 06 (outliers) are deliberately separate concerns; 04 runs first to neutralize sentinel codes before 06 computes statistics (see Section 9). Script 02 (text cleaner) runs first in the pipeline to normalize whitespace and special characters before any other operation. v1.3 spec: Streamlit GUI required at launch, with hosted demo deployed to Streamlit Community Cloud. |
| Automated Business Reporting | Not started | Aggregation -> styled PDF/Excel output |
| Ecommerce Data Pipeline | Not started | Extract -> clean -> export workflow |
| Small Business Finance | Not started | Bookkeeping summaries, simple reconciliation |
| Marketing Public Data Aggregation | Not started | Public API + scraping with respect for robots.txt and ToS |
| AI Ecommerce Aggregation (Shopify Pet) | Not started | Optional LLM enhancement, requires API key from buyer |

---

## 8. Open Technical Decisions

GUI framework choice is now **closed** (Streamlit, locked v1.3 - see DECISIONS.md Section 4c).

Remaining open items:

- **pywebview wrap of Streamlit launcher**: Optional v1.1 enhancement to eliminate the "browser opens" UX surprise. Defer until support tickets show meaningful buyer confusion. Cost: extra dependency, more PyInstaller complexity. Benefit: native-window UX.
- **Windows code signing**: Currently unsigned. Revisit if SmartScreen warnings drive support volume. Cost: ~$200-400/yr.
- **Auto-update mechanism**: None at launch. Email-delivered version updates. Revisit at 100+ paying customers per bundle.
- **Demo deployment hosting**: Streamlit Community Cloud at launch (free). Migrate to $5/mo VPS if rate limits, bandwidth, or branding constraints become an issue.
- **Code obfuscation**: Currently relying on license text + PyInstaller bundling. Decompilation is possible but unlikely for $49-79 products. Accept the risk.
- **Telemetry**: None. Consider privacy-respecting opt-in usage telemetry post-launch to inform roadmap, but only if explicit and disclosed.

---

## 9. Script Boundaries: 04 (Missing Values) vs 06 (Outliers)

The two scripts are deliberately separate. Original spec ("missing-value handler also does basic outlier flagging") was wrong: it conflated two different statistical operations and would have produced overlapping CLI flags, confused buyers, and brittle code.

### 9.1 Boundary

**`04_missing_value_handler.py` owns "what's not there"**:
- Detect disguised nulls: `NaN`, empty string, `"N/A"`, `"-"`, `"unknown"`, whitespace-only, sentinel codes (`-999`, `9999`, etc.).
- Missingness pattern analysis (which columns co-miss).
- Imputation strategies: mean, median, mode, forward-fill, KNN (optional), regression (optional).
- Required-field enforcement (drop rows missing a required column).
- Drop rows or columns by missingness threshold.

**`06_outlier_detector.py` owns "what shouldn't be there"**:
- Univariate statistical detection: z-score, IQR, modified z-score (MAD-based).
- Multivariate detection: Isolation Forest, Mahalanobis distance.
- Domain-rule violations (age > 120, negative quantity, future dates in historical data).
- Winsorization / capping as optional remediation.
- Distribution shape diagnostics.

### 9.2 Run Order

04 must run before 06. Reason: outlier statistics computed on data still containing `NaN` or sentinel codes are mathematically poisoned. Means and standard deviations get dragged, IQR widens, false negatives explode.

The Master Orchestrator (script 09) enforces this order. CLI users running scripts manually get a warning printed by 06 if it detects unhandled sentinel patterns in the input.

Pipeline-wide order enforced by the orchestrator: `02_text_cleaner` → `03_format_standardizer` → `04_missing_value_handler` → `05_column_mapper_enforcer` → `06_outlier_detector` → `07_multi_file_merger` → `08_validator_reporter`. Script `01_deduplicator` is order-flexible; it normalizes whitespace and case internally for matching purposes regardless of upstream cleaning, so it can run before or after `02_text_cleaner` depending on the buyer's workflow.

### 9.3 Contested Cases

**Use cases that prove 04 and 06 are distinct concerns** (not just naming differences):

- *Bank export with blank fee columns*: pure 04 job. The fees aren't outliers, they're missing. Imputation or drop-by-threshold is the right tool.
- *Sales data with one $1M order in a $50-average column*: pure 06 job. Nothing is missing; one row is statistically extreme. Z-score or IQR catches it.
- *Survey data where `999` means "refused to answer"*: needs both, in order. 04 converts `999` to `NaN` per `--sentinels`, then 06 computes statistics on the cleaned column.

Sentinel values like `-999` are *both* disguised missing *and* statistical outliers. Resolution: 04 owns sentinel detection and converts them to `NaN` (or imputes per user choice) before 06 sees the data. Sentinel patterns are configurable in 04 via `--sentinels` flag.

Suspicious-but-plausible values (e.g., age = 110): 06's territory. Not missing; just rare.

Whitespace-only cells (e.g., `"   "`) are a contested case between 02 (text cleaner) and 04 (missing value handler). Resolution: 02 trims first, leaving an empty string. 04 then detects empty strings as disguised nulls per its existing logic. This means 02 must run before 04 in any pipeline that uses both. The orchestrator enforces this; CLI users get a warning if 04 detects whitespace-only cells suggesting 02 was skipped.

### 9.4 Shared Plumbing

Both scripts emit:
- A flagged-row report with column, row index, original value, action taken.
- A timestamped log file in `logs/`.
- An optional cleaned output file.

Report and log formats are identical between the two scripts. Implemented via shared helpers in `src/core/` to avoid drift.

---

## 10. Per-Script Functional Requirements

This section captures the full functional spec for each script, beyond the one-line description in USER-GUIDE.md Section 2. Specs answer "what does v1 need to ship to be best-of-class for the target buyer." Promoted from chat-history-only into docs in v1.6 to prevent silent drift.

**Note on script status**: a script labeled "Working" in the bundle status table means it has CLI execution and basic correctness, NOT that it implements every Tier 1 item below. Tier 1 is the v1 launch target; the current code may implement a subset.

### 10.1 `01_deduplicator.py` - Smart duplicate removal

**Current implementation status**: `01_deduplicator.py` exists and works for exact match plus basic fuzzy with configurable subset columns and timestamped logs (the description in USER-GUIDE.md Section 2 reflects current state). It does NOT yet implement most Tier 1 items below. Tier 1 is the v1 launch target, not current state. The Streamlit GUI port is the natural moment to close this gap.

**Strategic framing**: Excel's built-in Remove Duplicates handles exact match for free. Pandas `drop_duplicates()` handles it for free in code. A $49-$79 dedup tool that ships "exact + basic fuzzy" loses to Excel for free or to OpenRefine for free. The fuzzy matching has to be the product, not a checkbox. The market gap this script targets is "fuzzy match quality of OpenRefine, with the zero-learning-curve UX of Excel, sold once for under $100, runs locally" (see BUSINESS.md Section 4a).

#### Tier 1: Must-ship for v1 to be best-of-class

**Input handling**
1. Auto-detect file encoding (UTF-8, UTF-8-BOM, Latin-1, Windows-1252). Failure to handle this correctly is the #1 reason CSV tools crash on real-world business data.
2. Auto-detect delimiter (comma, tab, semicolon, pipe).
3. Read CSV, TSV, XLSX, XLS. For XLSX, support multi-sheet workbooks (let user pick or process each).
4. Handle files larger than RAM via chunked / streaming processing. A 500MB customer export should not crash the tool.
5. Header row detection with manual override.

**Matching**
6. Exact match with configurable subset columns.
7. Fuzzy match algorithms: Levenshtein, Jaro-Winkler, token-set ratio (rapidfuzz library). Multiple algorithms, not one. Different data types match better with different algorithms.
8. Per-column normalization before comparison:
   - Email: lowercase, strip whitespace, strip Gmail dots, strip `+tag` suffixes.
   - Phone: strip formatting and country codes, normalize to E.164.
   - Name: strip titles (Mr/Ms/Dr), strip suffixes (Jr/III), collapse whitespace, optional case-fold.
   - Address: USPS-style abbreviation normalization (St/Street, Ave/Avenue, Apt/#).
   - Generic string: trim, collapse internal whitespace, optional case-fold.
9. Configurable similarity threshold (e.g., 85%, 90%, 95%) per match strategy.
10. Multi-strategy matching with OR logic: "match if email is exact OR (name fuzzy >90% AND phone exact)." This is what real-world dedup actually requires. Single-strategy match handles maybe 40% of cases.

**Survivor selection (which row to keep when duplicates are found)**
11. Configurable survivor rules: keep first, keep last, keep most-complete (fewest empty cells), keep most-recent (date column), keep manually-selected.
12. Merge mode: instead of deleting losers, fill missing fields in survivor from losers. Common real ask: combine partial records into one complete record.

**Trust and review**
13. Dry-run / preview mode by default. Output shows what *would* be merged before any file is written. Non-negotiable for trust. Aligns with Section 5.2 visible-safety standard.
14. Interactive review mode for uncertain matches. For matches in the gray zone (e.g., 75-90% similarity), prompt user yes/no/skip with side-by-side diff. This is what justifies a paid product over free Excel. GUI-natural; CLI gets a reduced-form prompt loop.
15. Confidence score on every fuzzy match in the output.
16. Match group export: separate file showing every group of matched rows so user can audit.

**Audit and safety**
17. Full timestamped log of every action: which rows matched, on which strategy, with what score, which row survived, which fields were merged.
18. Removed-duplicates exported to a separate file (never silently destroyed).
19. Original input file is never modified. Output is always a new file.
20. Idempotency: running the tool twice on the same input with the same config produces the same output.

**Configuration**
21. Save / load configuration profiles. A user who deduplicates a Shopify customer export weekly should configure once, not every run.
22. Sensible defaults that work on a typical messy CSV with zero configuration. The first run must produce a useful result with no flags. Per DECISIONS.md Section 4b UX standards.

**UX**
23. `--help` (CLI) written for non-technical users with concrete examples, not a flag list.
24. Progress bar for files over ~10K rows.
25. Error messages name the row number, column, and value that caused the problem. No raw stack traces. Per Section 5.1.
26. Sample data (`samples/messy_sales.csv`) must demonstrate fuzzy match scenarios, not just exact dupes. Otherwise the demo doesn't sell.

#### Tier 2: Worth-considering for v1.1

27. Numeric tolerance for matching (prices within $0.01 considered same).
28. Date tolerance for matching (dates within N days considered same).
29. Phonetic matching (Soundex, Metaphone) for name fields with common misspellings.
30. Blocking / indexing for performance on large files (compare only rows that share a first letter or first three characters of a key field). Without this, fuzzy match on 100K rows is O(n²) and unusable. Move to Tier 1 if early buyers report performance complaints.
31. Watch-folder mode: auto-process any file dropped into a folder.
32. Geolocation-aware address matching (optional, requires bundled USPS data or third-party API).

#### Tier 3: Optional / later

33. Machine-learning-based match scoring (Dedupe.io territory; high complexity, marginal gain for this price point).
34. Multi-table joins for cross-file dedup.
35. Schedule / cron integration.
36. Direct Shopify / Klaviyo / Mailchimp API integration to dedupe in place. This would be a real differentiator for the Shopify niche specifically and is probably the right v2 direction if early sales validate the niche.

### 10.2 - 10.9 (Future)

Functional specs for scripts 02 through 09 to be added when each script enters active build. The deduplicator spec is the template; specs for other scripts should follow the same Tier 1 / Tier 2 / Tier 3 structure with explicit strategic framing (what's the market gap this script fills, given that some of its functionality is available free elsewhere).
