15 KiB
Newsletter Link Catalog — Specification
Overview
A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately.
Architecture
Language & Runtime
- TypeScript/Node.js — compiled to a standalone binary via
pkgortsx-bundle - CLI tool invoked as
nlc run [flags]
Distribution
- Standalone binary — no Node runtime required on the host machine
- Built and packaged via CI or build script
Run Modes
- Manual: Run
nlc runon demand with optional date filters - Scheduled: Can be run via cron/Task Scheduler for recurring processing
- Designed for both; no daemon mode required
Gmail Integration
Authentication
- OAuth2 browser flow — user authorizes via browser, tokens persisted locally
nlc initcommand walks through OAuth setup interactively
Scope
- Processes emails from a single designated Gmail folder/label (configured in
config.yaml) - Does not scan the entire inbox or search by sender patterns
Email Processing
- HTML only — plain-text parts are ignored
- Image-only emails (single image, no extractable links) are skipped with a warning logged
- "View in browser" emails — fetches the web version's HTML and extracts links from that instead
- Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails
--fullflag forces reprocessing of all emails
Link Extraction & Processing
Extraction Pipeline
- Fetch emails from the configured Gmail folder (incremental or full)
- Parse HTML to extract links, section headers, and surrounding text
- Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links
- Unwrap tracking redirects and strip UTM parameters — store only the clean canonical URL
- Merge "Read more" links with their preceding content (detected by: same URL + "read more" anchor text)
- Categorize each link (see Categorization section)
- Write to spreadsheet (see Output section)
Noise Filtering
The following link types are excluded from content sheets:
- Unsubscribe links
- Social media footer links (Twitter, LinkedIn, etc.)
- "Share this newsletter" / "Forward to a friend" links
- "View in browser" mirror links (content is extracted from the web version instead)
Sponsor/ad links are not filtered — they go to a separate sheet.
URL Handling
- Unwrap all tracking redirects (Mailchimp, Substack, etc.)
- Strip UTM parameters and other tracking query params
- Store only the clean canonical URL
- Dead/broken links (4xx/5xx during enrichment) are moved to a separate "Dead Links" sheet
"Read More" Merging
When two consecutive elements point to the same URL and one has "read more" (or similar) anchor text, they are merged into a single entry combining the preceding description text and the link.
Categorization
Strategy: Hybrid
- Primary: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories
- Fallback: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords)
- Final fallback: LLM-based categorization when rules don't match
Category Taxonomy
- LLM-generated by default — the model assigns categories based on link content
- Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.)
- User can extend via config with custom categories
- LLM is instructed to prefer existing categories and only create new ones when nothing fits
LLM Provider Support (BYOK)
All providers supported, configurable in config.yaml:
- Claude/Anthropic — Anthropic API
- OpenAI/GPT — OpenAI API
- Local models — Ollama, LM Studio
- OpenAI-compatible endpoints — Mistral, Groq, Together, etc.
Provider config includes: API key, base URL, model name, and optional parameters.
Newsletter Parsing: Plugin System
- Generic HTML parser as the default
- Platform-specific parsers loaded as plugins (detected by URL patterns or email headers)
- Substack shipped as the first plugin — uses Substack's predictable HTML structure for more reliable extraction
- Additional parsers can be added as plugins without modifying core logic
Output: Spreadsheet
Supported Formats
- Google Sheets — via Google Sheets API (live, shareable, auto-updated)
- Local Excel (.xlsx) — written to disk, can be uploaded manually
Config selects which output(s) to use; both can be active simultaneously.
Spreadsheet Name
- Fixed name set in
config.yaml(e.g., "Newsletter Link Catalog")
Sheet Naming
- Each newsletter gets its own sheet named after the parsed display name from the email's From header
- Names truncated to fit Google Sheets' 100-character limit
- Special characters replaced as needed for sheet name validity
Content Sheet Columns
Every row is fully populated (flat table — no blank cells for grouping):
| Column | Description |
|---|---|
| Issue Date | Date from email's Date header (overridable per-newsletter) |
| Category | Assigned category (from newsletter sections, rules, or LLM) |
| Link URL | Clean canonical URL after unwrapping and UTM removal |
| Title | Anchor text / headline from the newsletter |
| Description | 1-2 sentence description from the newsletter (if present) |
| Page Title + Meta | <title> and meta description from the destination page (enrichment phase) |
| Source Newsletter | Name of the newsletter this link came from |
| Also In | Cross-reference: other newsletters that also mentioned this link |
Sponsor Sheet (Consolidated)
Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters:
| Column | Description | |---| | Newsletter | Which newsletter this sponsor link appeared in | | Sponsor | Sponsor name (parsed from newsletter) | | Link | Sponsor's link URL | | Description | Sponsor description from the newsletter |
Dead Links Sheet
Single sheet named "Dead Links" for links that returned errors during enrichment:
| Column | Description | |---| | URL | The clean canonical URL | | Status | HTTP status or error type (404, 403, timeout, etc.) | | Source | Newsletter name | | Date | Issue date |
Cross-References
- Duplicates across newsletters are kept in their respective sheets (all occurrences preserved)
- The Also In column annotates each row with which other newsletters mentioned the same link and when (e.g., "TLDR Web Dev (Mar 5)")
- This enables finding cross-newsletter coverage without a separate consolidated sheet
No "All Links" Master Sheet
Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet.
Enrichment
Two-Phase Approach
- Phase 1 (Store): Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata
- Phase 2 (Enrich): Separate pass to fetch each link's destination page for
<title>and meta description
This keeps the initial run fast and allows enrichment to be run independently.
Enrichment Details
- Configurable concurrency (safe defaults: 3-5 parallel, 1-2s delay between batches)
- Retries on transient failures
- Dead links (4xx/5xx) moved to Dead Links sheet
- Skip paywalled/auth-required pages (detected by login redirects) — mark with "paywall" status
- Progress bar shows enrichment status in real-time
Link Liveness
- Dead links are not included in content sheets — they go to the Dead Links sheet
- Paywalled/unreachable links are included in content sheets but flagged in the Page Title + Meta column
Processing Model
Incremental Processing
- Local state file (JSON) tracks processed Message-IDs and enrichment status
- On subsequent runs, only new/unprocessed emails are fetched
--fullflag forces reprocessing of all emails- State file location:
~/.nlc/state.json(or configured path)
Date Filtering
--from YYYY-MM-DDand--to YYYY-MM-DD— absolute date range--last N(e.g.,--last 30d,--last 7d) — relative date range- Can be combined with incremental processing
Dry Run
--dry-runprocesses the most recent X emails (default: 5) without writing to the spreadsheet- Shows what would be extracted, categorized, and written
- Useful for testing config changes and parser tweaks
Error Handling
- Critical errors (Gmail auth failure, spreadsheet write failure, config errors) → stop execution
- Individual errors (one link fails to enrich, one email fails to parse) → log and continue
- Summary at end includes error counts and details
Progress & Logging
- Progress bar during processing (emails fetched, links extracted, enrichment status)
- Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors
CLI Interface
Commands
nlc init # Interactive setup: OAuth, config file, connectivity test
nlc run [flags] # Main processing command
nlc run Flags
| Flag | Description | Default |
|---|---|---|
--full |
Reprocess all emails, not just new ones | false |
--dry-run [N] |
Process most recent N emails without writing to sheet | 5 |
--from YYYY-MM-DD |
Process emails from this date | (none) |
--to YYYY-MM-DD |
Process emails up to this date | (none) |
--last N |
Process emails from last N days (e.g., --last 30d) |
(none) |
--skip-enrich |
Skip the enrichment phase (only extract + categorize) | false |
--enrich-only |
Only run enrichment on already-extracted links | false |
--config PATH |
Path to config file | ./config.yaml |
--verbose |
Detailed per-email and per-link output | false |
Configuration
File Format: YAML
Location: ./config.yaml (overridable with --config)
Sample Structure
# Gmail settings
gmail:
folder: "Newsletters" # Gmail label/folder to process
credentials: "~/.nlc/gmail-credentials.json"
token: "~/.nlc/gmail-token.json"
# Output settings
output:
name: "Newsletter Link Catalog" # Spreadsheet name
sheets_api:
enabled: true
credentials: "~/.nlc/sheets-credentials.json"
token: "~/.nlc/sheets-token.json"
excel:
enabled: true
path: "./output/newsletter-catalog.xlsx"
# Newsletter identification
newsletters:
# Manual overrides for parsed display names
# sender_pattern: "display_name"
"alex@bytebytego.com": "ByteByteGo"
"dan@techtakesweekly.com": "Tech Takes Weekly"
# Link processing
links:
unwrap_redirects: true
strip_utm: true
filter_unsubscribe: true
filter_social_footer: true
filter_share_links: true
merge_read_more: true
# Categorization
categories:
# Built-in taxonomy is used by default; extend here
custom:
- "AI/ML"
- "Career"
- "Rust"
# LLM settings for category inference
llm:
provider: "anthropic" # anthropic | openai | local | openai-compatible
model: "claude-sonnet-4-6"
api_key_env: "ANTHROPIC_API_KEY" # or set in env
base_url: null # for local/openai-compatible
fallback_to_rules: true # if LLM fails, use rule-based
# Enrichment
enrichment:
enabled: true
concurrency: 3
delay_ms: 1500
retries: 2
timeout_ms: 10000
# Rate limiting (applies to both Gmail API and enrichment)
rate_limit:
gmail_qps: 5 # queries per second to Gmail API
link_concurrency: 3 # parallel link fetches
# State
state_file: "~/.nlc/state.json"
# Parsing plugins
plugins:
substack:
enabled: true
Issue Date Override
For newsletters where the email arrival date doesn't match the issue date, overrides can be configured:
newsletters:
"sender@domain.com":
display_name: "Newsletter Name"
date_override: "subject" # Parse date from subject line
date_format: "%B %d, %Y" # Expected date format in subject
Data Flow
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Gmail API │────▶│ Parse HTML │────▶│ Categorize │────▶│ Write Sheet │
│ (fetch) │ │ + Extract │ │ (hybrid) │ │ (Phase 1) │
└─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ State File │ │ Enrichment │
│ (processed │ │ (Phase 2) │
│ tracking) │ │ Page titles │
└──────────────┘ └──────────────┘
Edge Cases
| Scenario | Behavior |
|---|---|
| Email is a single image with no links | Skip with warning, log to state |
| "View in browser" link instead of content | Fetch web version HTML, extract links from that |
| Same link in multiple newsletters | Keep all occurrences, cross-reference via "Also In" column |
| Same link multiple times in one issue | Deduplicate per-issue; single row per unique URL |
| Link returns 4xx/5xx during enrichment | Move to Dead Links sheet |
| Link is paywalled/auth-required | Include in content sheet, mark Page Title as "[paywall]" |
| Newsletter name > 100 chars | Truncate for sheet name |
| Sheet already exists for newsletter | Append new rows, don't overwrite existing data |
| Gmail API rate limit | Retry with exponential backoff |
| OAuth token expired | Auto-refresh, re-prompt if refresh fails |
| Newsletter format changes | Parser falls back to generic HTML extraction |
Setup & First Run
-
nlc init— Interactive walkthrough:- Authenticate with Gmail (OAuth browser flow)
- Authenticate with Google Sheets (if using Sheets output)
- Select the Gmail folder/label to process
- Configure output location
- Test connectivity
- Generate
config.yaml
-
nlc run --dry-run— Test with 5 most recent emails -
nlc run— Full processing run -
nlc run --enrich-only— Enrich previously extracted links with page titles
Future Considerations
These are not in scope for v1 but noted for potential future work:
- Search/filter functionality within the spreadsheet
- Web UI for browsing the catalog
- Email forwarding as an alternative to Gmail API access
- Automatic category taxonomy refinement based on accumulated data
- Plugin system for additional newsletter platforms beyond Substack
- Notification on new newsletter processing