✨feature: First push to git

2026-05-16 14:02:49 -05:00
commit 265f69d95a
46 changed files with 11551 additions and 0 deletions
@@ -0,0 +1,378 @@
+# Newsletter Link Catalog — Specification
+
+## Overview
+
+A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately.
+
+## Architecture
+
+### Language & Runtime
+- **TypeScript/Node.js** — compiled to a standalone binary by the project build script
+- CLI tool invoked as `nlc run [flags]`
+
+### Distribution
+- Standalone binary — no Node runtime required on the host machine
+- Built and packaged via CI or build script
+- The build script must document the selected bundling tool and produce the binary from a clean checkout
+
+### Run Modes
+- **Manual**: Run `nlc run` on demand with optional date filters
+- **Scheduled**: Can be run via cron/Task Scheduler for recurring processing
+- Designed for both; no daemon mode required
+
+## Gmail Integration
+
+### Authentication
+- **OAuth2 browser flow** — user authorizes via browser, tokens persisted locally
+- `nlc init` command walks through OAuth setup interactively
+
+### Scope
+- Processes emails from a **single designated Gmail folder/label** (configured in `config.yaml`)
+- Does not scan the entire inbox or search by sender patterns
+
+### Email Processing
+- **HTML only** — plain-text parts are ignored
+- **Image-only emails** (single image, no extractable links) are skipped with a warning logged
+- **"View in browser" emails** — if the email contains no content links after noise filtering and contains a mirror link with anchor text matching `view in browser`, `view online`, or `read online`, fetch that mirror URL and extract links from the fetched HTML instead
+- Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails
+- `--full` flag forces reprocessing of all emails that match the configured label and any date filters
+
+## Link Extraction & Processing
+
+### Extraction Pipeline
+1. Fetch emails from the configured Gmail folder (incremental or full)
+2. Parse HTML to extract links, section headers, and surrounding text. A section header is the nearest preceding heading-like element (`h1`-`h6`, table row header, or bold standalone line) within the same content block.
+3. Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links
+4. Unwrap supported tracking redirects and strip configured tracking query parameters — store the normalized destination URL
+5. Merge "Read more" links with their preceding content (detected by: consecutive links with the same normalized URL and anchor text matching the configured read-more pattern)
+6. Categorize each link (see Categorization section)
+7. Write to spreadsheet (see Output section)
+
+### Noise Filtering
+The following link types are **excluded** from content sheets:
+- Unsubscribe links
+- Social media links in footer or sharing blocks
+- Links whose anchor text or accessible label matches configured share/forward patterns
+- "View in browser" mirror links (content is extracted from the web version instead)
+
+Sponsor/ad links are **not filtered** — they go to a separate sheet when the link is inside a block labeled with configured sponsor markers such as "sponsor", "sponsored", "ad", "advertisement", or "partner".
+
+### URL Handling
+- Unwrap HTTP redirects and supported provider redirect URLs up to the configured redirect limit
+- Strip configured tracking query parameters, including `utm_*`, `fbclid`, `gclid`, `mc_cid`, `mc_eid`, and provider-specific tracking parameters listed in config
+- Store the normalized destination URL after redirect unwrapping and query cleanup
+- Dead/broken links (4xx/5xx during enrichment) are written to the "Dead Links" sheet and removed from content sheets when they were already written by an earlier phase or run
+
+### "Read More" Merging
+When two consecutive extracted links point to the same normalized URL and one anchor text matches the configured read-more pattern, they are merged into a single entry combining the preceding link title/description with the read-more link URL.
+
+## Categorization
+
+### Strategy: Hybrid
+1. **Primary**: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories
+2. **Fallback**: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords)
+3. **Final fallback**: LLM-based categorization when rules don't match
+
+### Category Taxonomy
+- Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.)
+- User can extend via config with custom categories
+- For fallback categorization, the LLM is instructed to prefer configured categories and may create a new category only when no existing category fits
+
+### LLM Provider Support (BYOK)
+The tool supports a provider adapter interface and ships adapters for:
+- **Claude/Anthropic** — Anthropic API
+- **OpenAI/GPT** — OpenAI API
+- **Local models** — Ollama, LM Studio
+- **OpenAI-compatible endpoints** — Mistral, Groq, Together, etc.
+
+Provider config includes: API key environment variable, base URL when required, model name, and optional provider parameters.
+
+### Newsletter Parsing: Plugin System
+- Generic HTML parser as the default
+- Platform-specific parsers loaded as plugins (detected by URL patterns or email headers)
+- **Substack** shipped as the first plugin — maps Substack-specific HTML structures to the common extracted-link format
+- Additional parsers can be added as plugins without modifying core logic
+
+## Output: Spreadsheet
+
+### Supported Formats
+- **Google Sheets** — via Google Sheets API (live, shareable, updated by each write run)
+- **Local Excel (.xlsx)** — written to disk, can be uploaded manually
+
+Config selects which output(s) to use; both can be active simultaneously.
+
+### Spreadsheet Name
+- Fixed name set in `config.yaml` (e.g., "Newsletter Link Catalog")
+
+### Sheet Naming
+- Each newsletter gets its own sheet named after the parsed display name from the email's From header
+- Names truncated to fit Google Sheets' 100-character limit
+- Characters invalid for Google Sheets or Excel sheet names are replaced with spaces, then repeated whitespace is collapsed
+
+### Content Sheet Columns
+Every link occurrence is written as a flat row; blank grouping rows are not used. Fields unavailable from the source are written as empty cells.
+
+| Column | Description |
+|---|---|
+| Issue Date | Date from email's Date header (overridable per-newsletter) |
+| Category | Assigned category (from newsletter sections, rules, or LLM) |
+| Link URL | Clean canonical URL after unwrapping and UTM removal |
+| Title | Anchor text / headline from the newsletter |
+| Description | 1-2 sentence description from the newsletter (if present) |
+| Page Title + Meta | `<title>` and meta description from the destination page (enrichment phase) |
+| Source Newsletter | Name of the newsletter this link came from |
+| Also In | Cross-reference: other newsletters that also mentioned this link |
+
+### Sponsor Sheet (Consolidated)
+Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters:
+
+| Column | Description |
+|---|
+| Newsletter | Which newsletter this sponsor link appeared in |
+| Sponsor | Sponsor name (parsed from newsletter) |
+| Link | Sponsor's link URL |
+| Description | Sponsor description from the newsletter |
+
+### Dead Links Sheet
+Single sheet named "Dead Links" for links that returned errors during enrichment:
+
+| Column | Description |
+|---|
+| URL | The clean canonical URL |
+| Status | HTTP status or error type (404, 403, timeout, etc.) |
+| Source | Newsletter name |
+| Date | Issue date |
+
+### Cross-References
+- Duplicates across newsletters are kept in their respective sheets (all occurrences preserved)
+- The **Also In** column annotates each row with other newsletter issues that mentioned the same normalized URL, formatted as `Newsletter Name (YYYY-MM-DD)` and joined with `; `
+- This enables finding cross-newsletter coverage without a separate consolidated sheet
+
+### No "All Links" Master Sheet
+Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet.
+
+## Enrichment
+
+### Two-Phase Approach
+1. **Phase 1 (Store)**: Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata
+2. **Phase 2 (Enrich)**: Separate pass to fetch each link's destination page for `<title>` and meta description
+
+Enrichment can be run independently from extraction and spreadsheet writing.
+
+### Enrichment Details
+- Configurable concurrency with defaults of 3 parallel requests and 1500 ms delay between batches
+- Retries on transient failures
+- Dead links (4xx/5xx) are written to the Dead Links sheet and removed from content sheets when they were already written by an earlier phase or run
+- Skip pages that redirect to a URL whose path or query contains `login`, `signin`, `subscribe`, or `paywall` — mark with "paywall" status
+- Progress bar updates after each completed enrichment request
+
+### Link Liveness
+- Dead links are **not included** in content sheets — they go to the Dead Links sheet
+- Paywalled links are included in content sheets and the Page Title + Meta column is set to `[paywall]`
+- Timeout, DNS, TLS, and network failures are included in content sheets and the Page Title + Meta column is set to `[unreachable: error_type]`
+
+## Processing Model
+
+### Incremental Processing
+- Local state file (JSON) tracks processed Message-IDs and enrichment status
+- On subsequent runs, only new/unprocessed emails are fetched
+- `--full` flag forces reprocessing of all emails that match the configured label and any date filters
+- State file location: `~/.nlc/state.json` (or configured path)
+
+### Date Filtering
+- `--from YYYY-MM-DD` and `--to YYYY-MM-DD` — absolute date range
+- `--last N` (e.g., `--last 30d`, `--last 7d`) — relative date range
+- Date filters apply before the incremental processed-message check
+- If both `--last` and `--from`/`--to` are provided, the CLI exits with a config error
+
+### Dry Run
+- `--dry-run` processes the most recent N emails (default: 5) without writing to the spreadsheet
+- Shows what would be extracted, categorized, and written
+- Dry run does not update the state file or call destination pages for enrichment unless `--dry-run` is combined with `--enrich-only`
+
+### Error Handling
+- **Critical errors** (Gmail auth failure, spreadsheet write failure, config errors) → stop execution
+- **Individual errors** (one link fails to enrich, one email fails to parse) → log and continue
+- Summary at end includes error counts and details
+
+### Progress & Logging
+- Progress bar during processing (emails fetched, links extracted, enrichment status)
+- Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors
+
+## CLI Interface
+
+### Commands
+
+```
+nlc init          # Interactive setup: OAuth, config file, connectivity test
+nlc run [flags]   # Main processing command
+```
+
+### `nlc run` Flags
+
+| Flag | Description | Default |
+|---|---|---|
+| `--full` | Reprocess all emails, not just new ones | false |
+| `--dry-run [N]` | Process most recent N emails without writing to sheet | 5 |
+| `--from YYYY-MM-DD` | Process emails from this date | (none) |
+| `--to YYYY-MM-DD` | Process emails up to this date | (none) |
+| `--last N` | Process emails from last N days (e.g., `--last 30d`) | (none) |
+| `--skip-enrich` | Skip the enrichment phase (only extract + categorize) | false |
+| `--enrich-only` | Only run enrichment on already-extracted links | false |
+| `--config PATH` | Path to config file | `./config.yaml` |
+| `--verbose` | Detailed per-email and per-link output | false |
+
+## Configuration
+
+### File Format: YAML
+Location: `./config.yaml` (overridable with `--config`)
+
+### Sample Structure
+
+```yaml
+# Gmail settings
+gmail:
+  folder: "Newsletters"           # Gmail label/folder to process
+  credentials: "~/.nlc/gmail-credentials.json"
+  token: "~/.nlc/gmail-token.json"
+
+# Output settings
+output:
+  name: "Newsletter Link Catalog"  # Spreadsheet name
+  sheets_api:
+    enabled: true
+    credentials: "~/.nlc/sheets-credentials.json"
+    token: "~/.nlc/sheets-token.json"
+  excel:
+    enabled: true
+    path: "./output/newsletter-catalog.xlsx"
+
+# Newsletter identification
+newsletters:
+  # Manual overrides for parsed display names
+  "alex@bytebytego.com":
+    display_name: "ByteByteGo"
+  "dan@techtakesweekly.com":
+    display_name: "Tech Takes Weekly"
+
+# Link processing
+links:
+  unwrap_redirects: true
+  strip_utm: true
+  tracking_params:
+    - "utm_*"
+    - "fbclid"
+    - "gclid"
+    - "mc_cid"
+    - "mc_eid"
+  redirect_limit: 5
+  read_more_pattern: "(?i)^(read more|continue reading|learn more)$"
+  share_patterns:
+    - "(?i)share"
+    - "(?i)forward to a friend"
+  sponsor_markers:
+    - "(?i)sponsor"
+    - "(?i)sponsored"
+    - "(?i)advertisement"
+    - "(?i)partner"
+  filter_unsubscribe: true
+  filter_social_footer: true
+  filter_share_links: true
+  merge_read_more: true
+
+# Categorization
+categories:
+  # Built-in taxonomy is used by default; extend here
+  custom:
+    - "AI/ML"
+    - "Career"
+    - "Rust"
+  # LLM settings for category inference
+  llm:
+    provider: "anthropic"         # anthropic | openai | local | openai-compatible
+    model: "claude-sonnet-4-6"
+    api_key_env: "ANTHROPIC_API_KEY"
+    base_url: null                # for local/openai-compatible
+    failure_category: "Uncategorized"
+
+# Enrichment
+enrichment:
+  enabled: true
+  concurrency: 3
+  delay_ms: 1500
+  retries: 2
+  timeout_ms: 10000
+
+# Rate limiting (applies to both Gmail API and enrichment)
+rate_limit:
+  gmail_qps: 5                    # queries per second to Gmail API
+  link_concurrency: 3             # parallel link fetches
+
+# State
+state_file: "~/.nlc/state.json"
+
+# Parsing plugins
+plugins:
+  substack:
+    enabled: true
+```
+
+### Issue Date Override
+For newsletters where the email arrival date doesn't match the issue date, overrides can be configured:
+
+```yaml
+newsletters:
+  "sender@domain.com":
+    display_name: "Newsletter Name"
+    date_override: "subject"      # Parse date from subject line
+    date_format: "%B %d, %Y"      # Expected date format in subject
+```
+
+## Data Flow
+
+```
+┌─────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
+│  Gmail API  │────▶│ Parse HTML   │────▶│  Categorize  │────▶│ Write Sheet  │
+│  (fetch)    │     │ + Extract    │     │  (hybrid)    │     │ (Phase 1)    │
+└─────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
+                           │                                          │
+                           ▼                                          ▼
+                    ┌──────────────┐                           ┌──────────────┐
+                    │  State File  │                           │  Enrichment  │
+                    │  (processed  │                           │  (Phase 2)   │
+                    │   tracking)  │                           │  Page titles  │
+                    └──────────────┘                           └──────────────┘
+```
+
+## Edge Cases
+
+| Scenario | Behavior |
+|---|---|
+| Email is a single image with no links | Skip with warning, log to state |
+| "View in browser" link instead of content | Fetch the first matching mirror link, extract links from that HTML |
+| Same link in multiple newsletters | Keep all occurrences, cross-reference via "Also In" column |
+| Same link multiple times in one issue | Deduplicate per-issue; single row per unique URL |
+| Link returns 4xx/5xx during enrichment | Move to Dead Links sheet |
+| Link is paywalled/auth-required | Include in content sheet, mark Page Title + Meta as "[paywall]" |
+| Link times out or has a network error | Include in content sheet, mark Page Title + Meta as "[unreachable: error_type]" |
+| Newsletter name > 100 chars | Truncate for sheet name |
+| Sheet already exists for newsletter | Append new rows, don't overwrite existing data |
+| Gmail API rate limit | Retry with exponential backoff |
+| OAuth token expired | Auto-refresh, re-prompt if refresh fails |
+| Newsletter format changes | Parser falls back to generic HTML extraction |
+
+## Setup & First Run
+
+1. **`nlc init`** — Interactive walkthrough:
+   - Authenticate with Gmail (OAuth browser flow)
+   - Authenticate with Google Sheets (if using Sheets output)
+   - Select the Gmail folder/label to process
+   - Configure output location
+   - Test connectivity
+   - Generate `config.yaml`
+
+2. **`nlc run --dry-run`** — Test with 5 most recent emails
+
+3. **`nlc run`** — Full processing run
+
+4. **`nlc run --enrich-only`** — Enrich previously extracted links with page titles