# Newsletter Link Catalog — Specification ## Overview A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately. ## Architecture ### Language & Runtime - **TypeScript/Node.js** — compiled to a standalone binary by the project build script - CLI tool invoked as `nlc run [flags]` ### Distribution - Standalone binary — no Node runtime required on the host machine - Built and packaged via CI or build script - The build script must document the selected bundling tool and produce the binary from a clean checkout ### Run Modes - **Manual**: Run `nlc run` on demand with optional date filters - **Scheduled**: Can be run via cron/Task Scheduler for recurring processing - Designed for both; no daemon mode required ## Gmail Integration ### Authentication - **OAuth2 browser flow** — user authorizes via browser, tokens persisted locally - `nlc init` command walks through OAuth setup interactively ### Scope - Processes emails from a **single designated Gmail folder/label** (configured in `config.yaml`) - Does not scan the entire inbox or search by sender patterns ### Email Processing - **HTML only** — plain-text parts are ignored - **Image-only emails** (single image, no extractable links) are skipped with a warning logged - **"View in browser" emails** — if the email contains no content links after noise filtering and contains a mirror link with anchor text matching `view in browser`, `view online`, or `read online`, fetch that mirror URL and extract links from the fetched HTML instead - Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails - `--full` flag forces reprocessing of all emails that match the configured label and any date filters ## Link Extraction & Processing ### Extraction Pipeline 1. Fetch emails from the configured Gmail folder (incremental or full) 2. Parse HTML to extract links, section headers, and surrounding text. A section header is the nearest preceding heading-like element (`h1`-`h6`, table row header, or bold standalone line) within the same content block. 3. Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links 4. Unwrap supported tracking redirects and strip configured tracking query parameters — store the normalized destination URL 5. Merge "Read more" links with their preceding content (detected by: consecutive links with the same normalized URL and anchor text matching the configured read-more pattern) 6. Categorize each link (see Categorization section) 7. Write to spreadsheet (see Output section) ### Noise Filtering The following link types are **excluded** from content sheets: - Unsubscribe links - Social media links in footer or sharing blocks - Links whose anchor text or accessible label matches configured share/forward patterns - "View in browser" mirror links (content is extracted from the web version instead) Sponsor/ad links are **not filtered** — they go to a separate sheet when the link is inside a block labeled with configured sponsor markers such as "sponsor", "sponsored", "ad", "advertisement", or "partner". ### URL Handling - Unwrap HTTP redirects and supported provider redirect URLs up to the configured redirect limit - Strip configured tracking query parameters, including `utm_*`, `fbclid`, `gclid`, `mc_cid`, `mc_eid`, and provider-specific tracking parameters listed in config - Store the normalized destination URL after redirect unwrapping and query cleanup - Dead/broken links (4xx/5xx during enrichment) are written to the "Dead Links" sheet and removed from content sheets when they were already written by an earlier phase or run ### "Read More" Merging When two consecutive extracted links point to the same normalized URL and one anchor text matches the configured read-more pattern, they are merged into a single entry combining the preceding link title/description with the read-more link URL. ## Categorization ### Strategy: Hybrid 1. **Primary**: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories 2. **Fallback**: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords) 3. **Final fallback**: LLM-based categorization when rules don't match ### Category Taxonomy - Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.) - User can extend via config with custom categories - For fallback categorization, the LLM is instructed to prefer configured categories and may create a new category only when no existing category fits ### LLM Provider Support (BYOK) The tool supports a provider adapter interface and ships adapters for: - **Claude/Anthropic** — Anthropic API - **OpenAI/GPT** — OpenAI API - **Local models** — Ollama, LM Studio - **OpenAI-compatible endpoints** — Mistral, Groq, Together, etc. Provider config includes: API key environment variable, base URL when required, model name, and optional provider parameters. ### Newsletter Parsing: Plugin System - Generic HTML parser as the default - Platform-specific parsers loaded as plugins (detected by URL patterns or email headers) - **Substack** shipped as the first plugin — maps Substack-specific HTML structures to the common extracted-link format - Additional parsers can be added as plugins without modifying core logic ## Output: Spreadsheet ### Supported Formats - **Google Sheets** — via Google Sheets API (live, shareable, updated by each write run) - **Local Excel (.xlsx)** — written to disk, can be uploaded manually Config selects which output(s) to use; both can be active simultaneously. ### Spreadsheet Name - Fixed name set in `config.yaml` (e.g., "Newsletter Link Catalog") ### Sheet Naming - Each newsletter gets its own sheet named after the parsed display name from the email's From header - Names truncated to fit Google Sheets' 100-character limit - Characters invalid for Google Sheets or Excel sheet names are replaced with spaces, then repeated whitespace is collapsed ### Content Sheet Columns Every link occurrence is written as a flat row; blank grouping rows are not used. Fields unavailable from the source are written as empty cells. | Column | Description | |---|---| | Issue Date | Date from email's Date header (overridable per-newsletter) | | Category | Assigned category (from newsletter sections, rules, or LLM) | | Link URL | Clean canonical URL after unwrapping and UTM removal | | Title | Anchor text / headline from the newsletter | | Description | 1-2 sentence description from the newsletter (if present) | | Page Title + Meta | `` and meta description from the destination page (enrichment phase) | | Source Newsletter | Name of the newsletter this link came from | | Also In | Cross-reference: other newsletters that also mentioned this link | ### Sponsor Sheet (Consolidated) Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters: | Column | Description | |---| | Newsletter | Which newsletter this sponsor link appeared in | | Sponsor | Sponsor name (parsed from newsletter) | | Link | Sponsor's link URL | | Description | Sponsor description from the newsletter | ### Dead Links Sheet Single sheet named "Dead Links" for links that returned errors during enrichment: | Column | Description | |---| | URL | The clean canonical URL | | Status | HTTP status or error type (404, 403, timeout, etc.) | | Source | Newsletter name | | Date | Issue date | ### Cross-References - Duplicates across newsletters are kept in their respective sheets (all occurrences preserved) - The **Also In** column annotates each row with other newsletter issues that mentioned the same normalized URL, formatted as `Newsletter Name (YYYY-MM-DD)` and joined with `; ` - This enables finding cross-newsletter coverage without a separate consolidated sheet ### No "All Links" Master Sheet Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet. ## Enrichment ### Two-Phase Approach 1. **Phase 1 (Store)**: Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata 2. **Phase 2 (Enrich)**: Separate pass to fetch each link's destination page for `<title>` and meta description Enrichment can be run independently from extraction and spreadsheet writing. ### Enrichment Details - Configurable concurrency with defaults of 3 parallel requests and 1500 ms delay between batches - Retries on transient failures - Dead links (4xx/5xx) are written to the Dead Links sheet and removed from content sheets when they were already written by an earlier phase or run - Skip pages that redirect to a URL whose path or query contains `login`, `signin`, `subscribe`, or `paywall` — mark with "paywall" status - Progress bar updates after each completed enrichment request ### Link Liveness - Dead links are **not included** in content sheets — they go to the Dead Links sheet - Paywalled links are included in content sheets and the Page Title + Meta column is set to `[paywall]` - Timeout, DNS, TLS, and network failures are included in content sheets and the Page Title + Meta column is set to `[unreachable: error_type]` ## Processing Model ### Incremental Processing - Local state file (JSON) tracks processed Message-IDs and enrichment status - On subsequent runs, only new/unprocessed emails are fetched - `--full` flag forces reprocessing of all emails that match the configured label and any date filters - State file location: `~/.nlc/state.json` (or configured path) ### Date Filtering - `--from YYYY-MM-DD` and `--to YYYY-MM-DD` — absolute date range - `--last N` (e.g., `--last 30d`, `--last 7d`) — relative date range - Date filters apply before the incremental processed-message check - If both `--last` and `--from`/`--to` are provided, the CLI exits with a config error ### Dry Run - `--dry-run` processes the most recent N emails (default: 5) without writing to the spreadsheet - Shows what would be extracted, categorized, and written - Dry run does not update the state file or call destination pages for enrichment unless `--dry-run` is combined with `--enrich-only` ### Error Handling - **Critical errors** (Gmail auth failure, spreadsheet write failure, config errors) → stop execution - **Individual errors** (one link fails to enrich, one email fails to parse) → log and continue - Summary at end includes error counts and details ### Progress & Logging - Progress bar during processing (emails fetched, links extracted, enrichment status) - Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors ## CLI Interface ### Commands ``` nlc init # Interactive setup: OAuth, config file, connectivity test nlc run [flags] # Main processing command ``` ### `nlc run` Flags | Flag | Description | Default | |---|---|---| | `--full` | Reprocess all emails, not just new ones | false | | `--dry-run [N]` | Process most recent N emails without writing to sheet | 5 | | `--from YYYY-MM-DD` | Process emails from this date | (none) | | `--to YYYY-MM-DD` | Process emails up to this date | (none) | | `--last N` | Process emails from last N days (e.g., `--last 30d`) | (none) | | `--skip-enrich` | Skip the enrichment phase (only extract + categorize) | false | | `--enrich-only` | Only run enrichment on already-extracted links | false | | `--config PATH` | Path to config file | `./config.yaml` | | `--verbose` | Detailed per-email and per-link output | false | ## Configuration ### File Format: YAML Location: `./config.yaml` (overridable with `--config`) ### Sample Structure ```yaml # Gmail settings gmail: folder: "Newsletters" # Gmail label/folder to process credentials: "~/.nlc/gmail-credentials.json" token: "~/.nlc/gmail-token.json" # Output settings output: name: "Newsletter Link Catalog" # Spreadsheet name sheets_api: enabled: true credentials: "~/.nlc/sheets-credentials.json" token: "~/.nlc/sheets-token.json" excel: enabled: true path: "./output/newsletter-catalog.xlsx" # Newsletter identification newsletters: # Manual overrides for parsed display names "alex@bytebytego.com": display_name: "ByteByteGo" "dan@techtakesweekly.com": display_name: "Tech Takes Weekly" # Link processing links: unwrap_redirects: true strip_utm: true tracking_params: - "utm_*" - "fbclid" - "gclid" - "mc_cid" - "mc_eid" redirect_limit: 5 read_more_pattern: "(?i)^(read more|continue reading|learn more)$" share_patterns: - "(?i)share" - "(?i)forward to a friend" sponsor_markers: - "(?i)sponsor" - "(?i)sponsored" - "(?i)advertisement" - "(?i)partner" filter_unsubscribe: true filter_social_footer: true filter_share_links: true merge_read_more: true # Categorization categories: # Built-in taxonomy is used by default; extend here custom: - "AI/ML" - "Career" - "Rust" # LLM settings for category inference llm: provider: "anthropic" # anthropic | openai | local | openai-compatible model: "claude-sonnet-4-6" api_key_env: "ANTHROPIC_API_KEY" base_url: null # for local/openai-compatible failure_category: "Uncategorized" # Enrichment enrichment: enabled: true concurrency: 3 delay_ms: 1500 retries: 2 timeout_ms: 10000 # Rate limiting (applies to both Gmail API and enrichment) rate_limit: gmail_qps: 5 # queries per second to Gmail API link_concurrency: 3 # parallel link fetches # State state_file: "~/.nlc/state.json" # Parsing plugins plugins: substack: enabled: true ``` ### Issue Date Override For newsletters where the email arrival date doesn't match the issue date, overrides can be configured: ```yaml newsletters: "sender@domain.com": display_name: "Newsletter Name" date_override: "subject" # Parse date from subject line date_format: "%B %d, %Y" # Expected date format in subject ``` ## Data Flow ``` ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Gmail API │────▶│ Parse HTML │────▶│ Categorize │────▶│ Write Sheet │ │ (fetch) │ │ + Extract │ │ (hybrid) │ │ (Phase 1) │ └─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │ │ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ State File │ │ Enrichment │ │ (processed │ │ (Phase 2) │ │ tracking) │ │ Page titles │ └──────────────┘ └──────────────┘ ``` ## Edge Cases | Scenario | Behavior | |---|---| | Email is a single image with no links | Skip with warning, log to state | | "View in browser" link instead of content | Fetch the first matching mirror link, extract links from that HTML | | Same link in multiple newsletters | Keep all occurrences, cross-reference via "Also In" column | | Same link multiple times in one issue | Deduplicate per-issue; single row per unique URL | | Link returns 4xx/5xx during enrichment | Move to Dead Links sheet | | Link is paywalled/auth-required | Include in content sheet, mark Page Title + Meta as "[paywall]" | | Link times out or has a network error | Include in content sheet, mark Page Title + Meta as "[unreachable: error_type]" | | Newsletter name > 100 chars | Truncate for sheet name | | Sheet already exists for newsletter | Append new rows, don't overwrite existing data | | Gmail API rate limit | Retry with exponential backoff | | OAuth token expired | Auto-refresh, re-prompt if refresh fails | | Newsletter format changes | Parser falls back to generic HTML extraction | ## Setup & First Run 1. **`nlc init`** — Interactive walkthrough: - Authenticate with Gmail (OAuth browser flow) - Authenticate with Google Sheets (if using Sheets output) - Select the Gmail folder/label to process - Configure output location - Test connectivity - Generate `config.yaml` 2. **`nlc run --dry-run`** — Test with 5 most recent emails 3. **`nlc run`** — Full processing run 4. **`nlc run --enrich-only`** — Enrich previously extracted links with page titles