Files
2026-05-16 14:02:49 -05:00

15 KiB

Newsletter Link Catalog — Specification

Overview

A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately.

Architecture

Language & Runtime

  • TypeScript/Node.js — compiled to a standalone binary via pkg or tsx-bundle
  • CLI tool invoked as nlc run [flags]

Distribution

  • Standalone binary — no Node runtime required on the host machine
  • Built and packaged via CI or build script

Run Modes

  • Manual: Run nlc run on demand with optional date filters
  • Scheduled: Can be run via cron/Task Scheduler for recurring processing
  • Designed for both; no daemon mode required

Gmail Integration

Authentication

  • OAuth2 browser flow — user authorizes via browser, tokens persisted locally
  • nlc init command walks through OAuth setup interactively

Scope

  • Processes emails from a single designated Gmail folder/label (configured in config.yaml)
  • Does not scan the entire inbox or search by sender patterns

Email Processing

  • HTML only — plain-text parts are ignored
  • Image-only emails (single image, no extractable links) are skipped with a warning logged
  • "View in browser" emails — fetches the web version's HTML and extracts links from that instead
  • Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails
  • --full flag forces reprocessing of all emails

Extraction Pipeline

  1. Fetch emails from the configured Gmail folder (incremental or full)
  2. Parse HTML to extract links, section headers, and surrounding text
  3. Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links
  4. Unwrap tracking redirects and strip UTM parameters — store only the clean canonical URL
  5. Merge "Read more" links with their preceding content (detected by: same URL + "read more" anchor text)
  6. Categorize each link (see Categorization section)
  7. Write to spreadsheet (see Output section)

Noise Filtering

The following link types are excluded from content sheets:

  • Unsubscribe links
  • Social media footer links (Twitter, LinkedIn, etc.)
  • "Share this newsletter" / "Forward to a friend" links
  • "View in browser" mirror links (content is extracted from the web version instead)

Sponsor/ad links are not filtered — they go to a separate sheet.

URL Handling

  • Unwrap all tracking redirects (Mailchimp, Substack, etc.)
  • Strip UTM parameters and other tracking query params
  • Store only the clean canonical URL
  • Dead/broken links (4xx/5xx during enrichment) are moved to a separate "Dead Links" sheet

"Read More" Merging

When two consecutive elements point to the same URL and one has "read more" (or similar) anchor text, they are merged into a single entry combining the preceding description text and the link.

Categorization

Strategy: Hybrid

  1. Primary: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories
  2. Fallback: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords)
  3. Final fallback: LLM-based categorization when rules don't match

Category Taxonomy

  • LLM-generated by default — the model assigns categories based on link content
  • Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.)
  • User can extend via config with custom categories
  • LLM is instructed to prefer existing categories and only create new ones when nothing fits

LLM Provider Support (BYOK)

All providers supported, configurable in config.yaml:

  • Claude/Anthropic — Anthropic API
  • OpenAI/GPT — OpenAI API
  • Local models — Ollama, LM Studio
  • OpenAI-compatible endpoints — Mistral, Groq, Together, etc.

Provider config includes: API key, base URL, model name, and optional parameters.

Newsletter Parsing: Plugin System

  • Generic HTML parser as the default
  • Platform-specific parsers loaded as plugins (detected by URL patterns or email headers)
  • Substack shipped as the first plugin — uses Substack's predictable HTML structure for more reliable extraction
  • Additional parsers can be added as plugins without modifying core logic

Output: Spreadsheet

Supported Formats

  • Google Sheets — via Google Sheets API (live, shareable, auto-updated)
  • Local Excel (.xlsx) — written to disk, can be uploaded manually

Config selects which output(s) to use; both can be active simultaneously.

Spreadsheet Name

  • Fixed name set in config.yaml (e.g., "Newsletter Link Catalog")

Sheet Naming

  • Each newsletter gets its own sheet named after the parsed display name from the email's From header
  • Names truncated to fit Google Sheets' 100-character limit
  • Special characters replaced as needed for sheet name validity

Content Sheet Columns

Every row is fully populated (flat table — no blank cells for grouping):

Column Description
Issue Date Date from email's Date header (overridable per-newsletter)
Category Assigned category (from newsletter sections, rules, or LLM)
Link URL Clean canonical URL after unwrapping and UTM removal
Title Anchor text / headline from the newsletter
Description 1-2 sentence description from the newsletter (if present)
Page Title + Meta <title> and meta description from the destination page (enrichment phase)
Source Newsletter Name of the newsletter this link came from
Also In Cross-reference: other newsletters that also mentioned this link

Sponsor Sheet (Consolidated)

Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters:

| Column | Description | |---| | Newsletter | Which newsletter this sponsor link appeared in | | Sponsor | Sponsor name (parsed from newsletter) | | Link | Sponsor's link URL | | Description | Sponsor description from the newsletter |

Single sheet named "Dead Links" for links that returned errors during enrichment:

| Column | Description | |---| | URL | The clean canonical URL | | Status | HTTP status or error type (404, 403, timeout, etc.) | | Source | Newsletter name | | Date | Issue date |

Cross-References

  • Duplicates across newsletters are kept in their respective sheets (all occurrences preserved)
  • The Also In column annotates each row with which other newsletters mentioned the same link and when (e.g., "TLDR Web Dev (Mar 5)")
  • This enables finding cross-newsletter coverage without a separate consolidated sheet

Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet.

Enrichment

Two-Phase Approach

  1. Phase 1 (Store): Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata
  2. Phase 2 (Enrich): Separate pass to fetch each link's destination page for <title> and meta description

This keeps the initial run fast and allows enrichment to be run independently.

Enrichment Details

  • Configurable concurrency (safe defaults: 3-5 parallel, 1-2s delay between batches)
  • Retries on transient failures
  • Dead links (4xx/5xx) moved to Dead Links sheet
  • Skip paywalled/auth-required pages (detected by login redirects) — mark with "paywall" status
  • Progress bar shows enrichment status in real-time
  • Dead links are not included in content sheets — they go to the Dead Links sheet
  • Paywalled/unreachable links are included in content sheets but flagged in the Page Title + Meta column

Processing Model

Incremental Processing

  • Local state file (JSON) tracks processed Message-IDs and enrichment status
  • On subsequent runs, only new/unprocessed emails are fetched
  • --full flag forces reprocessing of all emails
  • State file location: ~/.nlc/state.json (or configured path)

Date Filtering

  • --from YYYY-MM-DD and --to YYYY-MM-DD — absolute date range
  • --last N (e.g., --last 30d, --last 7d) — relative date range
  • Can be combined with incremental processing

Dry Run

  • --dry-run processes the most recent X emails (default: 5) without writing to the spreadsheet
  • Shows what would be extracted, categorized, and written
  • Useful for testing config changes and parser tweaks

Error Handling

  • Critical errors (Gmail auth failure, spreadsheet write failure, config errors) → stop execution
  • Individual errors (one link fails to enrich, one email fails to parse) → log and continue
  • Summary at end includes error counts and details

Progress & Logging

  • Progress bar during processing (emails fetched, links extracted, enrichment status)
  • Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors

CLI Interface

Commands

nlc init          # Interactive setup: OAuth, config file, connectivity test
nlc run [flags]   # Main processing command

nlc run Flags

Flag Description Default
--full Reprocess all emails, not just new ones false
--dry-run [N] Process most recent N emails without writing to sheet 5
--from YYYY-MM-DD Process emails from this date (none)
--to YYYY-MM-DD Process emails up to this date (none)
--last N Process emails from last N days (e.g., --last 30d) (none)
--skip-enrich Skip the enrichment phase (only extract + categorize) false
--enrich-only Only run enrichment on already-extracted links false
--config PATH Path to config file ./config.yaml
--verbose Detailed per-email and per-link output false

Configuration

File Format: YAML

Location: ./config.yaml (overridable with --config)

Sample Structure

# Gmail settings
gmail:
  folder: "Newsletters"           # Gmail label/folder to process
  credentials: "~/.nlc/gmail-credentials.json"
  token: "~/.nlc/gmail-token.json"

# Output settings
output:
  name: "Newsletter Link Catalog"  # Spreadsheet name
  sheets_api:
    enabled: true
    credentials: "~/.nlc/sheets-credentials.json"
    token: "~/.nlc/sheets-token.json"
  excel:
    enabled: true
    path: "./output/newsletter-catalog.xlsx"

# Newsletter identification
newsletters:
  # Manual overrides for parsed display names
  # sender_pattern: "display_name"
  "alex@bytebytego.com": "ByteByteGo"
  "dan@techtakesweekly.com": "Tech Takes Weekly"

# Link processing
links:
  unwrap_redirects: true
  strip_utm: true
  filter_unsubscribe: true
  filter_social_footer: true
  filter_share_links: true
  merge_read_more: true

# Categorization
categories:
  # Built-in taxonomy is used by default; extend here
  custom:
    - "AI/ML"
    - "Career"
    - "Rust"
  # LLM settings for category inference
  llm:
    provider: "anthropic"         # anthropic | openai | local | openai-compatible
    model: "claude-sonnet-4-6"
    api_key_env: "ANTHROPIC_API_KEY"  # or set in env
    base_url: null                # for local/openai-compatible
    fallback_to_rules: true       # if LLM fails, use rule-based

# Enrichment
enrichment:
  enabled: true
  concurrency: 3
  delay_ms: 1500
  retries: 2
  timeout_ms: 10000

# Rate limiting (applies to both Gmail API and enrichment)
rate_limit:
  gmail_qps: 5                    # queries per second to Gmail API
  link_concurrency: 3             # parallel link fetches

# State
state_file: "~/.nlc/state.json"

# Parsing plugins
plugins:
  substack:
    enabled: true

Issue Date Override

For newsletters where the email arrival date doesn't match the issue date, overrides can be configured:

newsletters:
  "sender@domain.com":
    display_name: "Newsletter Name"
    date_override: "subject"      # Parse date from subject line
    date_format: "%B %d, %Y"      # Expected date format in subject

Data Flow

┌─────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Gmail API  │────▶│ Parse HTML   │────▶│  Categorize  │────▶│ Write Sheet  │
│  (fetch)    │     │ + Extract    │     │  (hybrid)    │     │ (Phase 1)    │
└─────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
                           │                                          │
                           ▼                                          ▼
                    ┌──────────────┐                           ┌──────────────┐
                    │  State File  │                           │  Enrichment  │
                    │  (processed  │                           │  (Phase 2)   │
                    │   tracking)  │                           │  Page titles  │
                    └──────────────┘                           └──────────────┘

Edge Cases

Scenario Behavior
Email is a single image with no links Skip with warning, log to state
"View in browser" link instead of content Fetch web version HTML, extract links from that
Same link in multiple newsletters Keep all occurrences, cross-reference via "Also In" column
Same link multiple times in one issue Deduplicate per-issue; single row per unique URL
Link returns 4xx/5xx during enrichment Move to Dead Links sheet
Link is paywalled/auth-required Include in content sheet, mark Page Title as "[paywall]"
Newsletter name > 100 chars Truncate for sheet name
Sheet already exists for newsletter Append new rows, don't overwrite existing data
Gmail API rate limit Retry with exponential backoff
OAuth token expired Auto-refresh, re-prompt if refresh fails
Newsletter format changes Parser falls back to generic HTML extraction

Setup & First Run

  1. nlc init — Interactive walkthrough:

    • Authenticate with Gmail (OAuth browser flow)
    • Authenticate with Google Sheets (if using Sheets output)
    • Select the Gmail folder/label to process
    • Configure output location
    • Test connectivity
    • Generate config.yaml
  2. nlc run --dry-run — Test with 5 most recent emails

  3. nlc run — Full processing run

  4. nlc run --enrich-only — Enrich previously extracted links with page titles

Future Considerations

These are not in scope for v1 but noted for potential future work:

  • Search/filter functionality within the spreadsheet
  • Web UI for browsing the catalog
  • Email forwarding as an alternative to Gmail API access
  • Automatic category taxonomy refinement based on accumulated data
  • Plugin system for additional newsletter platforms beyond Substack
  • Notification on new newsletter processing