Files
2026-05-16 14:02:49 -05:00

17 KiB

Newsletter Link Catalog — Specification

Overview

A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately.

Architecture

Language & Runtime

  • TypeScript/Node.js — compiled to a standalone binary by the project build script
  • CLI tool invoked as nlc run [flags]

Distribution

  • Standalone binary — no Node runtime required on the host machine
  • Built and packaged via CI or build script
  • The build script must document the selected bundling tool and produce the binary from a clean checkout

Run Modes

  • Manual: Run nlc run on demand with optional date filters
  • Scheduled: Can be run via cron/Task Scheduler for recurring processing
  • Designed for both; no daemon mode required

Gmail Integration

Authentication

  • OAuth2 browser flow — user authorizes via browser, tokens persisted locally
  • nlc init command walks through OAuth setup interactively

Scope

  • Processes emails from a single designated Gmail folder/label (configured in config.yaml)
  • Does not scan the entire inbox or search by sender patterns

Email Processing

  • HTML only — plain-text parts are ignored
  • Image-only emails (single image, no extractable links) are skipped with a warning logged
  • "View in browser" emails — if the email contains no content links after noise filtering and contains a mirror link with anchor text matching view in browser, view online, or read online, fetch that mirror URL and extract links from the fetched HTML instead
  • Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails
  • --full flag forces reprocessing of all emails that match the configured label and any date filters

Extraction Pipeline

  1. Fetch emails from the configured Gmail folder (incremental or full)
  2. Parse HTML to extract links, section headers, and surrounding text. A section header is the nearest preceding heading-like element (h1-h6, table row header, or bold standalone line) within the same content block.
  3. Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links
  4. Unwrap supported tracking redirects and strip configured tracking query parameters — store the normalized destination URL
  5. Merge "Read more" links with their preceding content (detected by: consecutive links with the same normalized URL and anchor text matching the configured read-more pattern)
  6. Categorize each link (see Categorization section)
  7. Write to spreadsheet (see Output section)

Noise Filtering

The following link types are excluded from content sheets:

  • Unsubscribe links
  • Social media links in footer or sharing blocks
  • Links whose anchor text or accessible label matches configured share/forward patterns
  • "View in browser" mirror links (content is extracted from the web version instead)

Sponsor/ad links are not filtered — they go to a separate sheet when the link is inside a block labeled with configured sponsor markers such as "sponsor", "sponsored", "ad", "advertisement", or "partner".

URL Handling

  • Unwrap HTTP redirects and supported provider redirect URLs up to the configured redirect limit
  • Strip configured tracking query parameters, including utm_*, fbclid, gclid, mc_cid, mc_eid, and provider-specific tracking parameters listed in config
  • Store the normalized destination URL after redirect unwrapping and query cleanup
  • Dead/broken links (4xx/5xx during enrichment) are written to the "Dead Links" sheet and removed from content sheets when they were already written by an earlier phase or run

"Read More" Merging

When two consecutive extracted links point to the same normalized URL and one anchor text matches the configured read-more pattern, they are merged into a single entry combining the preceding link title/description with the read-more link URL.

Categorization

Strategy: Hybrid

  1. Primary: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories
  2. Fallback: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords)
  3. Final fallback: LLM-based categorization when rules don't match

Category Taxonomy

  • Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.)
  • User can extend via config with custom categories
  • For fallback categorization, the LLM is instructed to prefer configured categories and may create a new category only when no existing category fits

LLM Provider Support (BYOK)

The tool supports a provider adapter interface and ships adapters for:

  • Claude/Anthropic — Anthropic API
  • OpenAI/GPT — OpenAI API
  • Local models — Ollama, LM Studio
  • OpenAI-compatible endpoints — Mistral, Groq, Together, etc.

Provider config includes: API key environment variable, base URL when required, model name, and optional provider parameters.

Newsletter Parsing: Plugin System

  • Generic HTML parser as the default
  • Platform-specific parsers loaded as plugins (detected by URL patterns or email headers)
  • Substack shipped as the first plugin — maps Substack-specific HTML structures to the common extracted-link format
  • Additional parsers can be added as plugins without modifying core logic

Output: Spreadsheet

Supported Formats

  • Google Sheets — via Google Sheets API (live, shareable, updated by each write run)
  • Local Excel (.xlsx) — written to disk, can be uploaded manually

Config selects which output(s) to use; both can be active simultaneously.

Spreadsheet Name

  • Fixed name set in config.yaml (e.g., "Newsletter Link Catalog")

Sheet Naming

  • Each newsletter gets its own sheet named after the parsed display name from the email's From header
  • Names truncated to fit Google Sheets' 100-character limit
  • Characters invalid for Google Sheets or Excel sheet names are replaced with spaces, then repeated whitespace is collapsed

Content Sheet Columns

Every link occurrence is written as a flat row; blank grouping rows are not used. Fields unavailable from the source are written as empty cells.

Column Description
Issue Date Date from email's Date header (overridable per-newsletter)
Category Assigned category (from newsletter sections, rules, or LLM)
Link URL Clean canonical URL after unwrapping and UTM removal
Title Anchor text / headline from the newsletter
Description 1-2 sentence description from the newsletter (if present)
Page Title + Meta <title> and meta description from the destination page (enrichment phase)
Source Newsletter Name of the newsletter this link came from
Also In Cross-reference: other newsletters that also mentioned this link

Sponsor Sheet (Consolidated)

Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters:

| Column | Description | |---| | Newsletter | Which newsletter this sponsor link appeared in | | Sponsor | Sponsor name (parsed from newsletter) | | Link | Sponsor's link URL | | Description | Sponsor description from the newsletter |

Single sheet named "Dead Links" for links that returned errors during enrichment:

| Column | Description | |---| | URL | The clean canonical URL | | Status | HTTP status or error type (404, 403, timeout, etc.) | | Source | Newsletter name | | Date | Issue date |

Cross-References

  • Duplicates across newsletters are kept in their respective sheets (all occurrences preserved)
  • The Also In column annotates each row with other newsletter issues that mentioned the same normalized URL, formatted as Newsletter Name (YYYY-MM-DD) and joined with ;
  • This enables finding cross-newsletter coverage without a separate consolidated sheet

Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet.

Enrichment

Two-Phase Approach

  1. Phase 1 (Store): Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata
  2. Phase 2 (Enrich): Separate pass to fetch each link's destination page for <title> and meta description

Enrichment can be run independently from extraction and spreadsheet writing.

Enrichment Details

  • Configurable concurrency with defaults of 3 parallel requests and 1500 ms delay between batches
  • Retries on transient failures
  • Dead links (4xx/5xx) are written to the Dead Links sheet and removed from content sheets when they were already written by an earlier phase or run
  • Skip pages that redirect to a URL whose path or query contains login, signin, subscribe, or paywall — mark with "paywall" status
  • Progress bar updates after each completed enrichment request
  • Dead links are not included in content sheets — they go to the Dead Links sheet
  • Paywalled links are included in content sheets and the Page Title + Meta column is set to [paywall]
  • Timeout, DNS, TLS, and network failures are included in content sheets and the Page Title + Meta column is set to [unreachable: error_type]

Processing Model

Incremental Processing

  • Local state file (JSON) tracks processed Message-IDs and enrichment status
  • On subsequent runs, only new/unprocessed emails are fetched
  • --full flag forces reprocessing of all emails that match the configured label and any date filters
  • State file location: ~/.nlc/state.json (or configured path)

Date Filtering

  • --from YYYY-MM-DD and --to YYYY-MM-DD — absolute date range
  • --last N (e.g., --last 30d, --last 7d) — relative date range
  • Date filters apply before the incremental processed-message check
  • If both --last and --from/--to are provided, the CLI exits with a config error

Dry Run

  • --dry-run processes the most recent N emails (default: 5) without writing to the spreadsheet
  • Shows what would be extracted, categorized, and written
  • Dry run does not update the state file or call destination pages for enrichment unless --dry-run is combined with --enrich-only

Error Handling

  • Critical errors (Gmail auth failure, spreadsheet write failure, config errors) → stop execution
  • Individual errors (one link fails to enrich, one email fails to parse) → log and continue
  • Summary at end includes error counts and details

Progress & Logging

  • Progress bar during processing (emails fetched, links extracted, enrichment status)
  • Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors

CLI Interface

Commands

nlc init          # Interactive setup: OAuth, config file, connectivity test
nlc run [flags]   # Main processing command

nlc run Flags

Flag Description Default
--full Reprocess all emails, not just new ones false
--dry-run [N] Process most recent N emails without writing to sheet 5
--from YYYY-MM-DD Process emails from this date (none)
--to YYYY-MM-DD Process emails up to this date (none)
--last N Process emails from last N days (e.g., --last 30d) (none)
--skip-enrich Skip the enrichment phase (only extract + categorize) false
--enrich-only Only run enrichment on already-extracted links false
--config PATH Path to config file ./config.yaml
--verbose Detailed per-email and per-link output false

Configuration

File Format: YAML

Location: ./config.yaml (overridable with --config)

Sample Structure

# Gmail settings
gmail:
  folder: "Newsletters"           # Gmail label/folder to process
  credentials: "~/.nlc/gmail-credentials.json"
  token: "~/.nlc/gmail-token.json"

# Output settings
output:
  name: "Newsletter Link Catalog"  # Spreadsheet name
  sheets_api:
    enabled: true
    credentials: "~/.nlc/sheets-credentials.json"
    token: "~/.nlc/sheets-token.json"
  excel:
    enabled: true
    path: "./output/newsletter-catalog.xlsx"

# Newsletter identification
newsletters:
  # Manual overrides for parsed display names
  "alex@bytebytego.com":
    display_name: "ByteByteGo"
  "dan@techtakesweekly.com":
    display_name: "Tech Takes Weekly"

# Link processing
links:
  unwrap_redirects: true
  strip_utm: true
  tracking_params:
    - "utm_*"
    - "fbclid"
    - "gclid"
    - "mc_cid"
    - "mc_eid"
  redirect_limit: 5
  read_more_pattern: "(?i)^(read more|continue reading|learn more)$"
  share_patterns:
    - "(?i)share"
    - "(?i)forward to a friend"
  sponsor_markers:
    - "(?i)sponsor"
    - "(?i)sponsored"
    - "(?i)advertisement"
    - "(?i)partner"
  filter_unsubscribe: true
  filter_social_footer: true
  filter_share_links: true
  merge_read_more: true

# Categorization
categories:
  # Built-in taxonomy is used by default; extend here
  custom:
    - "AI/ML"
    - "Career"
    - "Rust"
  # LLM settings for category inference
  llm:
    provider: "anthropic"         # anthropic | openai | local | openai-compatible
    model: "claude-sonnet-4-6"
    api_key_env: "ANTHROPIC_API_KEY"
    base_url: null                # for local/openai-compatible
    failure_category: "Uncategorized"

# Enrichment
enrichment:
  enabled: true
  concurrency: 3
  delay_ms: 1500
  retries: 2
  timeout_ms: 10000

# Rate limiting (applies to both Gmail API and enrichment)
rate_limit:
  gmail_qps: 5                    # queries per second to Gmail API
  link_concurrency: 3             # parallel link fetches

# State
state_file: "~/.nlc/state.json"

# Parsing plugins
plugins:
  substack:
    enabled: true

Issue Date Override

For newsletters where the email arrival date doesn't match the issue date, overrides can be configured:

newsletters:
  "sender@domain.com":
    display_name: "Newsletter Name"
    date_override: "subject"      # Parse date from subject line
    date_format: "%B %d, %Y"      # Expected date format in subject

Data Flow

┌─────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Gmail API  │────▶│ Parse HTML   │────▶│  Categorize  │────▶│ Write Sheet  │
│  (fetch)    │     │ + Extract    │     │  (hybrid)    │     │ (Phase 1)    │
└─────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
                           │                                          │
                           ▼                                          ▼
                    ┌──────────────┐                           ┌──────────────┐
                    │  State File  │                           │  Enrichment  │
                    │  (processed  │                           │  (Phase 2)   │
                    │   tracking)  │                           │  Page titles  │
                    └──────────────┘                           └──────────────┘

Edge Cases

Scenario Behavior
Email is a single image with no links Skip with warning, log to state
"View in browser" link instead of content Fetch the first matching mirror link, extract links from that HTML
Same link in multiple newsletters Keep all occurrences, cross-reference via "Also In" column
Same link multiple times in one issue Deduplicate per-issue; single row per unique URL
Link returns 4xx/5xx during enrichment Move to Dead Links sheet
Link is paywalled/auth-required Include in content sheet, mark Page Title + Meta as "[paywall]"
Link times out or has a network error Include in content sheet, mark Page Title + Meta as "[unreachable: error_type]"
Newsletter name > 100 chars Truncate for sheet name
Sheet already exists for newsletter Append new rows, don't overwrite existing data
Gmail API rate limit Retry with exponential backoff
OAuth token expired Auto-refresh, re-prompt if refresh fails
Newsletter format changes Parser falls back to generic HTML extraction

Setup & First Run

  1. nlc init — Interactive walkthrough:

    • Authenticate with Gmail (OAuth browser flow)
    • Authenticate with Google Sheets (if using Sheets output)
    • Select the Gmail folder/label to process
    • Configure output location
    • Test connectivity
    • Generate config.yaml
  2. nlc run --dry-run — Test with 5 most recent emails

  3. nlc run — Full processing run

  4. nlc run --enrich-only — Enrich previously extracted links with page titles