# USER-GUIDE.md - Excel & CSV Data Cleaning Mastery Bundle

**Version**: 1.6
**Last updated**: April 28, 2026

Thank you for purchasing the Data Cleaning Mastery Bundle. This guide covers installation and every script included.

---

## 1. Installation

The bundle is fully self-contained. **You do not need to install Python.**

### Windows

1. Download `BundleName-Setup-1.0.exe` from your purchase email.
2. Double-click the installer.
3. Follow the wizard. The installer creates a desktop shortcut named "Launch Bundle" and an entry in Start Menu.
4. Launch via the desktop shortcut. Your default browser will open to a local page (typically `http://localhost:8501`) where the data tool runs.

### macOS

1. Download `BundleName-1.0.dmg` from your purchase email.
2. Double-click the `.dmg` to mount it.
3. Drag the Bundle app into the Applications folder.
4. Launch from Applications, Spotlight, or Launchpad. Your default browser will open to a local page where the data tool runs.

The app is signed and notarized by Apple, so it opens cleanly with no security warnings.

### Linux

1. Download `BundleName-1.0.AppImage` from your purchase email.
2. Make it executable: `chmod +x BundleName-1.0.AppImage`
3. Double-click to run, or execute from a terminal. Your default browser will open to a local page where the data tool runs.

If AppImage doesn't work on your distribution, a `.tar.gz` fallback is available in your purchase email. Extract it and run `./run.sh` from the extracted folder.

### How the GUI works (important to know)

This tool runs in your browser **locally on your computer**. When you launch it, a small program starts a local server on your machine and opens your default browser to view it. This is normal and expected.

- **No internet is required.** Your data never leaves your computer.
- **Your data is not uploaded anywhere.** All processing happens on your machine.
- The browser is just the display surface. Closing the browser closes the GUI; the underlying program also stops.

If you prefer the command line, every script also ships as a CLI tool. See Section 3.

### Requirements

- Windows: Windows 10 or 11 (64-bit).
- macOS: macOS 11 Big Sur or later (Apple Silicon or Intel).
- Linux: any modern 64-bit distribution from 2020 onward.
- A modern default browser (Chrome, Edge, Firefox, or Safari from the last 3 years).
- ~400-500 MB free disk space.
- Internet connection: not required.

---

## 2. What's Included

**Scripts (in the `scripts/` folder)**:

| # | Script | Purpose | Status |
|---|---|---|---|
| 01 | `01_deduplicator.py` | Smart duplicate removal: exact match + basic fuzzy, configurable subset columns, full logs | Working |
| 02 | `02_text_cleaner.py` | Character-level hygiene: trim leading/trailing whitespace, collapse internal multi-spaces, strip non-printable characters, Unicode normalization (smart quotes, em-dashes, accents), remove zero-width characters, BOM handling, line-ending normalization, case operations | Skeleton |
| 03 | `03_format_standardizer.py` | Standardize dates, currencies, names, phone numbers, addresses | Skeleton |
| 04 | `04_missing_value_handler.py` | Detect and handle missing values: disguised nulls (`N/A`, `-`, blanks, sentinel codes), imputation (mean/median/mode/forward-fill), required-field enforcement, drop-by-threshold | Skeleton |
| 05 | `05_column_mapper_enforcer.py` | Rename columns, enforce a target schema | Skeleton |
| 06 | `06_outlier_detector.py` | Detect and flag statistical outliers (z-score, IQR, modified z-score), multivariate detection, domain-rule violations, optional winsorization | Skeleton |
| 07 | `07_multi_file_merger.py` | Merge multiple CSV or Excel files into one | Skeleton |
| 08 | `08_validator_reporter.py` | Validate data against rules, output PDF or Excel report | Skeleton |
| 09 | `09_master_orchestrator.py` | One-click launcher menu, calls any other script | Skeleton |

**Sample data (in the `samples/` folder)**:
- `messy_sales.csv` - intentionally dirty sales data for testing.
- `bank_export.xlsx` - sample bank export for testing missing-value handling and outlier detection.

---

## 3. Usage

You have two ways to use the bundle: the GUI (recommended for most users) or the CLI (for power users and automation).

### 3.1 GUI usage (recommended)

1. Launch the bundle via the desktop shortcut, app icon, or AppImage.
2. Your browser opens to the bundle's home page.
3. Select the script you want to use from the sidebar (Deduplicator, Format Standardizer, etc.).
4. Drop your file into the file uploader, or select from the included samples.
5. Sensible defaults are pre-filled. Click "Run" to see a preview of what the script will do.
6. Review the preview. If it looks right, click "Save Output" to write the cleaned file.

The GUI is designed to work out of the box with zero configuration. Advanced options are tucked into expandable "Advanced" panes for users who want them.

### 3.2 CLI usage

All scripts are also CLI tools with `--help` output.

**Basic usage** (from a terminal):

Windows (the bundle adds CLI tools to your PATH):
```
deduplicator samples\messy_sales.csv
```

macOS / Linux:
```
deduplicator samples/messy_sales.csv
```

**With options**:

```
deduplicator samples/messy_sales.csv --output cleaned.csv --subset email,phone
```

**Get help on any script**:

```
deduplicator --help
```

**Recommended run order**: If you are running scripts individually, run `02_text_cleaner` first to normalize whitespace and special characters, then `04_missing_value_handler` *before* `06_outlier_detector`. Outlier detection on data still containing blanks or sentinel codes (like `-999`) produces unreliable results because missing-value placeholders distort the statistics (means get dragged, IQR widens, false negatives explode). The Master Orchestrator (script 09) runs them in the correct order automatically.

---

## 4. Output

Every script writes:
- A cleaned output file next to the input (or wherever you specify).
- A timestamped log file in the `logs/` folder showing what changed and why.

Reports from `validator_reporter` go to the `reports/` folder as PDF or Excel.

The GUI also displays the output preview in-browser before any file is written. The original input file is never modified.

---

## 5. Troubleshooting

**The GUI won't launch / browser doesn't open**:
1. Wait 10-15 seconds after double-clicking. The local server takes a moment to start the first time.
2. If the browser doesn't open automatically, manually visit `http://localhost:8501` in your browser.
3. If you see a "port in use" error, another program is using port 8501. Close other instances of the bundle and try again.

**"Why is my browser opening?" / "Why does this need internet?"**:
This tool runs as a local web app. The browser is just the display; nothing is uploaded, nothing leaves your computer. No internet connection is used after install. This is the same approach used by many modern data tools (Jupyter notebooks, RStudio, etc.).

**Windows: "Windows protected your PC" SmartScreen warning**:
Click "More info" then "Run anyway." This is a standard warning for software without an extended-validation Windows code signing certificate.

**macOS: "App is damaged and cannot be opened"**:
This usually indicates the download was corrupted. Re-download from the link in your purchase email.

**Linux: AppImage will not run**:
Make sure it is executable: `chmod +x BundleName-1.0.AppImage`. If it still fails, your distribution may be missing FUSE; install with `sudo apt install libfuse2` (Debian/Ubuntu) or use the `.tar.gz` fallback.

**Script throws an error about a file**:
Check the log file in the `logs/` folder. The log explains exactly what went wrong and which row of input data triggered it.

**The GUI feels slow on a large file**:
Files over ~100,000 rows take longer to process. The GUI shows a progress bar. If you have very large files (millions of rows) consider using the CLI directly, which is faster for batch jobs.

**Need help**: Email the address on your purchase receipt.

---

## 6. License

Single-user license. Do not redistribute. See `LICENSE.txt` in the install folder.