Newsletter-Link-Catalog/notes/SPEC.md at main

keith/Newsletter-Link-Catalog

Fork 0

Files

T

Keith Solomon 265f69d95a ✨feature: First push to git

2026-05-16 14:02:49 -05:00

17 KiB

Raw Permalink Blame History

Overview

A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately.

Architecture

Language & Runtime

TypeScript/Node.js — compiled to a standalone binary by the project build script
CLI tool invoked as nlc run [flags]

Distribution

Standalone binary — no Node runtime required on the host machine
Built and packaged via CI or build script
The build script must document the selected bundling tool and produce the binary from a clean checkout

Run Modes

Manual: Run nlc run on demand with optional date filters
Scheduled: Can be run via cron/Task Scheduler for recurring processing
Designed for both; no daemon mode required

Gmail Integration

Authentication

OAuth2 browser flow — user authorizes via browser, tokens persisted locally
nlc init command walks through OAuth setup interactively

Scope

Processes emails from a single designated Gmail folder/label (configured in config.yaml)
Does not scan the entire inbox or search by sender patterns

Email Processing

HTML only — plain-text parts are ignored
Image-only emails (single image, no extractable links) are skipped with a warning logged
"View in browser" emails — if the email contains no content links after noise filtering and contains a mirror link with anchor text matching view in browser, view online, or read online, fetch that mirror URL and extract links from the fetched HTML instead
Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails
--full flag forces reprocessing of all emails that match the configured label and any date filters

Link Extraction & Processing

Extraction Pipeline

Fetch emails from the configured Gmail folder (incremental or full)
Parse HTML to extract links, section headers, and surrounding text. A section header is the nearest preceding heading-like element (h1-h6, table row header, or bold standalone line) within the same content block.
Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links
Unwrap supported tracking redirects and strip configured tracking query parameters — store the normalized destination URL
Merge "Read more" links with their preceding content (detected by: consecutive links with the same normalized URL and anchor text matching the configured read-more pattern)
Categorize each link (see Categorization section)
Write to spreadsheet (see Output section)

Noise Filtering

The following link types are excluded from content sheets:

Unsubscribe links
Social media links in footer or sharing blocks
Links whose anchor text or accessible label matches configured share/forward patterns
"View in browser" mirror links (content is extracted from the web version instead)

Sponsor/ad links are not filtered — they go to a separate sheet when the link is inside a block labeled with configured sponsor markers such as "sponsor", "sponsored", "ad", "advertisement", or "partner".

URL Handling

Unwrap HTTP redirects and supported provider redirect URLs up to the configured redirect limit
Strip configured tracking query parameters, including utm_*, fbclid, gclid, mc_cid, mc_eid, and provider-specific tracking parameters listed in config
Store the normalized destination URL after redirect unwrapping and query cleanup
Dead/broken links (4xx/5xx during enrichment) are written to the "Dead Links" sheet and removed from content sheets when they were already written by an earlier phase or run

"Read More" Merging

When two consecutive extracted links point to the same normalized URL and one anchor text matches the configured read-more pattern, they are merged into a single entry combining the preceding link title/description with the read-more link URL.

Categorization

Strategy: Hybrid

Primary: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories
Fallback: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords)
Final fallback: LLM-based categorization when rules don't match

Category Taxonomy

Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.)
User can extend via config with custom categories
For fallback categorization, the LLM is instructed to prefer configured categories and may create a new category only when no existing category fits

LLM Provider Support (BYOK)

The tool supports a provider adapter interface and ships adapters for:

Claude/Anthropic — Anthropic API
OpenAI/GPT — OpenAI API
Local models — Ollama, LM Studio
OpenAI-compatible endpoints — Mistral, Groq, Together, etc.

Provider config includes: API key environment variable, base URL when required, model name, and optional provider parameters.

Generic HTML parser as the default
Platform-specific parsers loaded as plugins (detected by URL patterns or email headers)
Substack shipped as the first plugin — maps Substack-specific HTML structures to the common extracted-link format
Additional parsers can be added as plugins without modifying core logic

Output: Spreadsheet

Supported Formats

Google Sheets — via Google Sheets API (live, shareable, updated by each write run)
Local Excel (.xlsx) — written to disk, can be uploaded manually

Config selects which output(s) to use; both can be active simultaneously.

Spreadsheet Name

Fixed name set in config.yaml (e.g., "Newsletter Link Catalog")

Sheet Naming

Each newsletter gets its own sheet named after the parsed display name from the email's From header
Names truncated to fit Google Sheets' 100-character limit
Characters invalid for Google Sheets or Excel sheet names are replaced with spaces, then repeated whitespace is collapsed

Content Sheet Columns

Every link occurrence is written as a flat row; blank grouping rows are not used. Fields unavailable from the source are written as empty cells.

Column	Description
Issue Date	Date from email's Date header (overridable per-newsletter)
Category	Assigned category (from newsletter sections, rules, or LLM)
Link URL	Clean canonical URL after unwrapping and UTM removal
Title	Anchor text / headline from the newsletter
Description	1-2 sentence description from the newsletter (if present)
Page Title + Meta	`<title>` and meta description from the destination page (enrichment phase)
Source Newsletter	Name of the newsletter this link came from
Also In	Cross-reference: other newsletters that also mentioned this link

Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters:

| Column | Description | |---| | Newsletter | Which newsletter this sponsor link appeared in | | Sponsor | Sponsor name (parsed from newsletter) | | Link | Sponsor's link URL | | Description | Sponsor description from the newsletter |

Dead Links Sheet

Single sheet named "Dead Links" for links that returned errors during enrichment:

| Column | Description | |---| | URL | The clean canonical URL | | Status | HTTP status or error type (404, 403, timeout, etc.) | | Source | Newsletter name | | Date | Issue date |

Cross-References

Duplicates across newsletters are kept in their respective sheets (all occurrences preserved)
The Also In column annotates each row with other newsletter issues that mentioned the same normalized URL, formatted as Newsletter Name (YYYY-MM-DD) and joined with ;
This enables finding cross-newsletter coverage without a separate consolidated sheet

No "All Links" Master Sheet

Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet.

Enrichment

Two-Phase Approach

Phase 1 (Store): Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata
Phase 2 (Enrich): Separate pass to fetch each link's destination page for <title> and meta description

Enrichment can be run independently from extraction and spreadsheet writing.

Enrichment Details

Configurable concurrency with defaults of 3 parallel requests and 1500 ms delay between batches
Retries on transient failures
Dead links (4xx/5xx) are written to the Dead Links sheet and removed from content sheets when they were already written by an earlier phase or run
Skip pages that redirect to a URL whose path or query contains login, signin, subscribe, or paywall — mark with "paywall" status
Progress bar updates after each completed enrichment request

Link Liveness

Dead links are not included in content sheets — they go to the Dead Links sheet
Paywalled links are included in content sheets and the Page Title + Meta column is set to [paywall]
Timeout, DNS, TLS, and network failures are included in content sheets and the Page Title + Meta column is set to [unreachable: error_type]

Processing Model

Incremental Processing

Local state file (JSON) tracks processed Message-IDs and enrichment status
On subsequent runs, only new/unprocessed emails are fetched
--full flag forces reprocessing of all emails that match the configured label and any date filters
State file location: ~/.nlc/state.json (or configured path)

Date Filtering

--from YYYY-MM-DD and --to YYYY-MM-DD — absolute date range
--last N (e.g., --last 30d, --last 7d) — relative date range
Date filters apply before the incremental processed-message check
If both --last and --from/--to are provided, the CLI exits with a config error

Dry Run

--dry-run processes the most recent N emails (default: 5) without writing to the spreadsheet
Shows what would be extracted, categorized, and written
Dry run does not update the state file or call destination pages for enrichment unless --dry-run is combined with --enrich-only

Error Handling

Critical errors (Gmail auth failure, spreadsheet write failure, config errors) → stop execution
Individual errors (one link fails to enrich, one email fails to parse) → log and continue
Summary at end includes error counts and details

Progress & Logging

Progress bar during processing (emails fetched, links extracted, enrichment status)
Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors

CLI Interface

Commands

nlc init          # Interactive setup: OAuth, config file, connectivity test
nlc run [flags]   # Main processing command

`nlc run` Flags

Flag	Description	Default
`--full`	Reprocess all emails, not just new ones	false
`--dry-run [N]`	Process most recent N emails without writing to sheet	5
`--from YYYY-MM-DD`	Process emails from this date	(none)
`--to YYYY-MM-DD`	Process emails up to this date	(none)
`--last N`	Process emails from last N days (e.g., `--last 30d`)	(none)
`--skip-enrich`	Skip the enrichment phase (only extract + categorize)	false
`--enrich-only`	Only run enrichment on already-extracted links	false
`--config PATH`	Path to config file	`./config.yaml`
`--verbose`	Detailed per-email and per-link output	false

Configuration

File Format: YAML

Location: ./config.yaml (overridable with --config)

Sample Structure

# Gmail settings
gmail:
  folder: "Newsletters"           # Gmail label/folder to process
  credentials: "~/.nlc/gmail-credentials.json"
  token: "~/.nlc/gmail-token.json"

# Output settings
output:
  name: "Newsletter Link Catalog"  # Spreadsheet name
  sheets_api:
    enabled: true
    credentials: "~/.nlc/sheets-credentials.json"
    token: "~/.nlc/sheets-token.json"
  excel:
    enabled: true
    path: "./output/newsletter-catalog.xlsx"

# Newsletter identification
newsletters:
  # Manual overrides for parsed display names
  "alex@bytebytego.com":
    display_name: "ByteByteGo"
  "dan@techtakesweekly.com":
    display_name: "Tech Takes Weekly"

# Link processing
links:
  unwrap_redirects: true
  strip_utm: true
  tracking_params:
    - "utm_*"
    - "fbclid"
    - "gclid"
    - "mc_cid"
    - "mc_eid"
  redirect_limit: 5
  read_more_pattern: "(?i)^(read more|continue reading|learn more)$"
  share_patterns:
    - "(?i)share"
    - "(?i)forward to a friend"
  sponsor_markers:
    - "(?i)sponsor"
    - "(?i)sponsored"
    - "(?i)advertisement"
    - "(?i)partner"
  filter_unsubscribe: true
  filter_social_footer: true
  filter_share_links: true
  merge_read_more: true

# Categorization
categories:
  # Built-in taxonomy is used by default; extend here
  custom:
    - "AI/ML"
    - "Career"
    - "Rust"
  # LLM settings for category inference
  llm:
    provider: "anthropic"         # anthropic | openai | local | openai-compatible
    model: "claude-sonnet-4-6"
    api_key_env: "ANTHROPIC_API_KEY"
    base_url: null                # for local/openai-compatible
    failure_category: "Uncategorized"

# Enrichment
enrichment:
  enabled: true
  concurrency: 3
  delay_ms: 1500
  retries: 2
  timeout_ms: 10000

# Rate limiting (applies to both Gmail API and enrichment)
rate_limit:
  gmail_qps: 5                    # queries per second to Gmail API
  link_concurrency: 3             # parallel link fetches

# State
state_file: "~/.nlc/state.json"

# Parsing plugins
plugins:
  substack:
    enabled: true

Issue Date Override

For newsletters where the email arrival date doesn't match the issue date, overrides can be configured:

newsletters:
  "sender@domain.com":
    display_name: "Newsletter Name"
    date_override: "subject"      # Parse date from subject line
    date_format: "%B %d, %Y"      # Expected date format in subject

Data Flow

┌─────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Gmail API  │────▶│ Parse HTML   │────▶│  Categorize  │────▶│ Write Sheet  │
│  (fetch)    │     │ + Extract    │     │  (hybrid)    │     │ (Phase 1)    │
└─────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
                           │                                          │
                           ▼                                          ▼
                    ┌──────────────┐                           ┌──────────────┐
                    │  State File  │                           │  Enrichment  │
                    │  (processed  │                           │  (Phase 2)   │
                    │   tracking)  │                           │  Page titles  │
                    └──────────────┘                           └──────────────┘

Edge Cases

Scenario	Behavior
Email is a single image with no links	Skip with warning, log to state
"View in browser" link instead of content	Fetch the first matching mirror link, extract links from that HTML
Same link in multiple newsletters	Keep all occurrences, cross-reference via "Also In" column
Same link multiple times in one issue	Deduplicate per-issue; single row per unique URL
Link returns 4xx/5xx during enrichment	Move to Dead Links sheet
Link is paywalled/auth-required	Include in content sheet, mark Page Title + Meta as "[paywall]"
Link times out or has a network error	Include in content sheet, mark Page Title + Meta as "[unreachable: error_type]"
Newsletter name > 100 chars	Truncate for sheet name
Sheet already exists for newsletter	Append new rows, don't overwrite existing data
Gmail API rate limit	Retry with exponential backoff
OAuth token expired	Auto-refresh, re-prompt if refresh fails
Newsletter format changes	Parser falls back to generic HTML extraction

Setup & First Run

nlc init — Interactive walkthrough:
- Authenticate with Gmail (OAuth browser flow)
- Authenticate with Google Sheets (if using Sheets output)
- Select the Gmail folder/label to process
- Configure output location
- Test connectivity
- Generate config.yaml
nlc run --dry-run — Test with 5 most recent emails
nlc run — Full processing run
nlc run --enrich-only — Enrich previously extracted links with page titles

17 KiB Raw Permalink Blame History

Newsletter Link Catalog — Specification