docs: add sqlite web app design
This commit is contained in:
@@ -0,0 +1,135 @@
|
||||
# SQLite Store and Web App Design
|
||||
|
||||
## Goal
|
||||
|
||||
Move Newsletter Link Catalog from spreadsheet-first output to a SQLite-first local catalog, while preserving spreadsheet output as an optional export/sync path. Add a functional local web app for browsing the catalog.
|
||||
|
||||
## Scope
|
||||
|
||||
This first pass keeps the working CLI ingestion flow. `nlc run` continues to fetch Gmail newsletters, parse links, categorize, enrich when enabled, and update incremental state. The main behavior change is that extracted data is persisted to SQLite by default.
|
||||
|
||||
The web app is intentionally plain and functional. It should make the catalog useful before investing in visual polish.
|
||||
|
||||
## Architecture
|
||||
|
||||
Add a persistence layer alongside the existing output writers:
|
||||
|
||||
- `DatabaseWriter` persists run results into SQLite.
|
||||
- Spreadsheet writers remain available through the existing config-driven write path during `nlc run`.
|
||||
- Web routes read from SQLite and do not re-run Gmail ingestion.
|
||||
- Existing parsing, filtering, categorization, enrichment, state, and OAuth modules stay in place.
|
||||
|
||||
The CLI remains the control surface for ingestion:
|
||||
|
||||
- `nlc run` writes to SQLite by default.
|
||||
- `nlc run` may also write Excel and/or Google Sheets if configured.
|
||||
- `nlc serve` starts a local web server against the configured SQLite database.
|
||||
|
||||
## SQLite Database
|
||||
|
||||
Default database path: `./data/newsletter-catalog.sqlite`, configurable as `database.path`.
|
||||
|
||||
Tables:
|
||||
|
||||
- `newsletters`: newsletter identity and display name.
|
||||
- `issues`: one row per processed email issue, linked to a newsletter.
|
||||
- `links`: canonical URL-level records.
|
||||
- `link_occurrences`: per-issue link appearances, including category, title, description, page metadata, and also-in text.
|
||||
- `sponsors`: sponsor/ad appearances, linked to issue and URL where possible.
|
||||
- `dead_links`: dead or unreachable link records.
|
||||
- `runs`: import run history with counts, timestamps, mode, and error counts.
|
||||
|
||||
Constraints:
|
||||
|
||||
- URLs are deduplicated in `links` by normalized URL.
|
||||
- Link occurrences are deduplicated per issue by normalized URL.
|
||||
- Processed Gmail message IDs remain tracked by the existing state file for now, avoiding a risky migration in this first pass.
|
||||
|
||||
## Configuration
|
||||
|
||||
Extend `config.yaml`:
|
||||
|
||||
```yaml
|
||||
database:
|
||||
enabled: true
|
||||
path: "./data/newsletter-catalog.sqlite"
|
||||
```
|
||||
|
||||
Behavior:
|
||||
|
||||
- If `database.enabled` is omitted, it defaults to `true`.
|
||||
- Spreadsheet output remains controlled by the existing `output.excel.enabled` and `output.sheets_api.enabled` settings.
|
||||
- If all outputs are disabled, `nlc run` still writes to SQLite.
|
||||
|
||||
## Web App
|
||||
|
||||
Add `nlc serve [flags]`.
|
||||
|
||||
Flags:
|
||||
|
||||
- `--config PATH`, default `./config.yaml`
|
||||
- `--host HOST`, default `127.0.0.1`
|
||||
- `--port PORT`, default `3000`
|
||||
|
||||
Views:
|
||||
|
||||
- Dashboard: counts for newsletters, issues, links, sponsors, dead links, and recent runs.
|
||||
- Links: searchable table of content links with filters for newsletter, category, and date range.
|
||||
- Newsletter detail: issues and links for one newsletter.
|
||||
- Sponsored links: global sponsor table.
|
||||
- Dead links: global dead/unreachable table.
|
||||
- Runs: recent import run summaries.
|
||||
|
||||
The first UI can use server-rendered HTML with minimal CSS. No SPA framework is required.
|
||||
|
||||
## Data Flow
|
||||
|
||||
Import:
|
||||
|
||||
1. Gmail fetches configured label messages.
|
||||
2. Existing parser extracts links and sponsor entries.
|
||||
3. Existing cleanup/categorization/enrichment pipeline prepares rows.
|
||||
4. `DatabaseWriter` upserts newsletters, issues, links, occurrences, sponsors, dead links, and run summary.
|
||||
5. Optional spreadsheet writers run after database persistence when configured.
|
||||
|
||||
Web:
|
||||
|
||||
1. `nlc serve` opens the SQLite database.
|
||||
2. Request handlers query read-only catalog views.
|
||||
3. HTML pages render tables and simple filters.
|
||||
|
||||
## Error Handling
|
||||
|
||||
- Database open/migration errors are critical and stop `nlc run` or `nlc serve`.
|
||||
- A single failed row insert during import increments the run error count and stops the database transaction for that run.
|
||||
- SQLite writes use transactions so partial import data is not committed on failure.
|
||||
- Web route errors return a simple 500 page and log the error.
|
||||
|
||||
## Testing
|
||||
|
||||
Add tests for:
|
||||
|
||||
- Config defaults for `database.enabled` and `database.path`.
|
||||
- Schema migration creates expected tables.
|
||||
- Database writer inserts newsletters, issues, links, sponsors, and dead links.
|
||||
- Duplicate normalized URLs are deduplicated while occurrences remain per issue.
|
||||
- `nlc run` writes to SQLite by default even when spreadsheet outputs are disabled.
|
||||
- Web routes render dashboard, links, sponsored links, dead links, and run history using a fixture database.
|
||||
- Existing spreadsheet writer tests continue to pass.
|
||||
|
||||
## Out of Scope for This Pass
|
||||
|
||||
- Browser-based Gmail OAuth setup.
|
||||
- Running Gmail imports from a web button.
|
||||
- User accounts or remote hosting.
|
||||
- Full visual design polish.
|
||||
- Migrating old JSON state into SQLite.
|
||||
- Removing Excel or Google Sheets support.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- `nlc run` creates/updates the SQLite database by default.
|
||||
- `nlc serve` starts a local web app and displays imported catalog data.
|
||||
- Spreadsheet output still works when enabled.
|
||||
- Existing CLI validations remain green.
|
||||
- Tests cover database persistence and web read paths without live Gmail or Google credentials.
|
||||
Reference in New Issue
Block a user