17 KiB
Newsletter Link Catalog — Specification
Overview
A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately.
Architecture
Language & Runtime
- TypeScript/Node.js — compiled to a standalone binary by the project build script
- CLI tool invoked as
nlc run [flags]
Distribution
- Standalone binary — no Node runtime required on the host machine
- Built and packaged via CI or build script
- The build script must document the selected bundling tool and produce the binary from a clean checkout
Run Modes
- Manual: Run
nlc runon demand with optional date filters - Scheduled: Can be run via cron/Task Scheduler for recurring processing
- Designed for both; no daemon mode required
Gmail Integration
Authentication
- OAuth2 browser flow — user authorizes via browser, tokens persisted locally
nlc initcommand walks through OAuth setup interactively
Scope
- Processes emails from a single designated Gmail folder/label (configured in
config.yaml) - Does not scan the entire inbox or search by sender patterns
Email Processing
- HTML only — plain-text parts are ignored
- Image-only emails (single image, no extractable links) are skipped with a warning logged
- "View in browser" emails — if the email contains no content links after noise filtering and contains a mirror link with anchor text matching
view in browser,view online, orread online, fetch that mirror URL and extract links from the fetched HTML instead - Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails
--fullflag forces reprocessing of all emails that match the configured label and any date filters
Link Extraction & Processing
Extraction Pipeline
- Fetch emails from the configured Gmail folder (incremental or full)
- Parse HTML to extract links, section headers, and surrounding text. A section header is the nearest preceding heading-like element (
h1-h6, table row header, or bold standalone line) within the same content block. - Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links
- Unwrap supported tracking redirects and strip configured tracking query parameters — store the normalized destination URL
- Merge "Read more" links with their preceding content (detected by: consecutive links with the same normalized URL and anchor text matching the configured read-more pattern)
- Categorize each link (see Categorization section)
- Write to spreadsheet (see Output section)
Noise Filtering
The following link types are excluded from content sheets:
- Unsubscribe links
- Social media links in footer or sharing blocks
- Links whose anchor text or accessible label matches configured share/forward patterns
- "View in browser" mirror links (content is extracted from the web version instead)
Sponsor/ad links are not filtered — they go to a separate sheet when the link is inside a block labeled with configured sponsor markers such as "sponsor", "sponsored", "ad", "advertisement", or "partner".
URL Handling
- Unwrap HTTP redirects and supported provider redirect URLs up to the configured redirect limit
- Strip configured tracking query parameters, including
utm_*,fbclid,gclid,mc_cid,mc_eid, and provider-specific tracking parameters listed in config - Store the normalized destination URL after redirect unwrapping and query cleanup
- Dead/broken links (4xx/5xx during enrichment) are written to the "Dead Links" sheet and removed from content sheets when they were already written by an earlier phase or run
"Read More" Merging
When two consecutive extracted links point to the same normalized URL and one anchor text matches the configured read-more pattern, they are merged into a single entry combining the preceding link title/description with the read-more link URL.
Categorization
Strategy: Hybrid
- Primary: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories
- Fallback: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords)
- Final fallback: LLM-based categorization when rules don't match
Category Taxonomy
- Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.)
- User can extend via config with custom categories
- For fallback categorization, the LLM is instructed to prefer configured categories and may create a new category only when no existing category fits
LLM Provider Support (BYOK)
The tool supports a provider adapter interface and ships adapters for:
- Claude/Anthropic — Anthropic API
- OpenAI/GPT — OpenAI API
- Local models — Ollama, LM Studio
- OpenAI-compatible endpoints — Mistral, Groq, Together, etc.
Provider config includes: API key environment variable, base URL when required, model name, and optional provider parameters.
Newsletter Parsing: Plugin System
- Generic HTML parser as the default
- Platform-specific parsers loaded as plugins (detected by URL patterns or email headers)
- Substack shipped as the first plugin — maps Substack-specific HTML structures to the common extracted-link format
- Additional parsers can be added as plugins without modifying core logic
Output: Spreadsheet
Supported Formats
- Google Sheets — via Google Sheets API (live, shareable, updated by each write run)
- Local Excel (.xlsx) — written to disk, can be uploaded manually
Config selects which output(s) to use; both can be active simultaneously.
Spreadsheet Name
- Fixed name set in
config.yaml(e.g., "Newsletter Link Catalog")
Sheet Naming
- Each newsletter gets its own sheet named after the parsed display name from the email's From header
- Names truncated to fit Google Sheets' 100-character limit
- Characters invalid for Google Sheets or Excel sheet names are replaced with spaces, then repeated whitespace is collapsed
Content Sheet Columns
Every link occurrence is written as a flat row; blank grouping rows are not used. Fields unavailable from the source are written as empty cells.
| Column | Description |
|---|---|
| Issue Date | Date from email's Date header (overridable per-newsletter) |
| Category | Assigned category (from newsletter sections, rules, or LLM) |
| Link URL | Clean canonical URL after unwrapping and UTM removal |
| Title | Anchor text / headline from the newsletter |
| Description | 1-2 sentence description from the newsletter (if present) |
| Page Title + Meta | <title> and meta description from the destination page (enrichment phase) |
| Source Newsletter | Name of the newsletter this link came from |
| Also In | Cross-reference: other newsletters that also mentioned this link |
Sponsor Sheet (Consolidated)
Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters:
| Column | Description | |---| | Newsletter | Which newsletter this sponsor link appeared in | | Sponsor | Sponsor name (parsed from newsletter) | | Link | Sponsor's link URL | | Description | Sponsor description from the newsletter |
Dead Links Sheet
Single sheet named "Dead Links" for links that returned errors during enrichment:
| Column | Description | |---| | URL | The clean canonical URL | | Status | HTTP status or error type (404, 403, timeout, etc.) | | Source | Newsletter name | | Date | Issue date |
Cross-References
- Duplicates across newsletters are kept in their respective sheets (all occurrences preserved)
- The Also In column annotates each row with other newsletter issues that mentioned the same normalized URL, formatted as
Newsletter Name (YYYY-MM-DD)and joined with; - This enables finding cross-newsletter coverage without a separate consolidated sheet
No "All Links" Master Sheet
Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet.
Enrichment
Two-Phase Approach
- Phase 1 (Store): Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata
- Phase 2 (Enrich): Separate pass to fetch each link's destination page for
<title>and meta description
Enrichment can be run independently from extraction and spreadsheet writing.
Enrichment Details
- Configurable concurrency with defaults of 3 parallel requests and 1500 ms delay between batches
- Retries on transient failures
- Dead links (4xx/5xx) are written to the Dead Links sheet and removed from content sheets when they were already written by an earlier phase or run
- Skip pages that redirect to a URL whose path or query contains
login,signin,subscribe, orpaywall— mark with "paywall" status - Progress bar updates after each completed enrichment request
Link Liveness
- Dead links are not included in content sheets — they go to the Dead Links sheet
- Paywalled links are included in content sheets and the Page Title + Meta column is set to
[paywall] - Timeout, DNS, TLS, and network failures are included in content sheets and the Page Title + Meta column is set to
[unreachable: error_type]
Processing Model
Incremental Processing
- Local state file (JSON) tracks processed Message-IDs and enrichment status
- On subsequent runs, only new/unprocessed emails are fetched
--fullflag forces reprocessing of all emails that match the configured label and any date filters- State file location:
~/.nlc/state.json(or configured path)
Date Filtering
--from YYYY-MM-DDand--to YYYY-MM-DD— absolute date range--last N(e.g.,--last 30d,--last 7d) — relative date range- Date filters apply before the incremental processed-message check
- If both
--lastand--from/--toare provided, the CLI exits with a config error
Dry Run
--dry-runprocesses the most recent N emails (default: 5) without writing to the spreadsheet- Shows what would be extracted, categorized, and written
- Dry run does not update the state file or call destination pages for enrichment unless
--dry-runis combined with--enrich-only
Error Handling
- Critical errors (Gmail auth failure, spreadsheet write failure, config errors) → stop execution
- Individual errors (one link fails to enrich, one email fails to parse) → log and continue
- Summary at end includes error counts and details
Progress & Logging
- Progress bar during processing (emails fetched, links extracted, enrichment status)
- Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors
CLI Interface
Commands
nlc init # Interactive setup: OAuth, config file, connectivity test
nlc run [flags] # Main processing command
nlc run Flags
| Flag | Description | Default |
|---|---|---|
--full |
Reprocess all emails, not just new ones | false |
--dry-run [N] |
Process most recent N emails without writing to sheet | 5 |
--from YYYY-MM-DD |
Process emails from this date | (none) |
--to YYYY-MM-DD |
Process emails up to this date | (none) |
--last N |
Process emails from last N days (e.g., --last 30d) |
(none) |
--skip-enrich |
Skip the enrichment phase (only extract + categorize) | false |
--enrich-only |
Only run enrichment on already-extracted links | false |
--config PATH |
Path to config file | ./config.yaml |
--verbose |
Detailed per-email and per-link output | false |
Configuration
File Format: YAML
Location: ./config.yaml (overridable with --config)
Sample Structure
# Gmail settings
gmail:
folder: "Newsletters" # Gmail label/folder to process
credentials: "~/.nlc/gmail-credentials.json"
token: "~/.nlc/gmail-token.json"
# Output settings
output:
name: "Newsletter Link Catalog" # Spreadsheet name
sheets_api:
enabled: true
credentials: "~/.nlc/sheets-credentials.json"
token: "~/.nlc/sheets-token.json"
excel:
enabled: true
path: "./output/newsletter-catalog.xlsx"
# Newsletter identification
newsletters:
# Manual overrides for parsed display names
"alex@bytebytego.com":
display_name: "ByteByteGo"
"dan@techtakesweekly.com":
display_name: "Tech Takes Weekly"
# Link processing
links:
unwrap_redirects: true
strip_utm: true
tracking_params:
- "utm_*"
- "fbclid"
- "gclid"
- "mc_cid"
- "mc_eid"
redirect_limit: 5
read_more_pattern: "(?i)^(read more|continue reading|learn more)$"
share_patterns:
- "(?i)share"
- "(?i)forward to a friend"
sponsor_markers:
- "(?i)sponsor"
- "(?i)sponsored"
- "(?i)advertisement"
- "(?i)partner"
filter_unsubscribe: true
filter_social_footer: true
filter_share_links: true
merge_read_more: true
# Categorization
categories:
# Built-in taxonomy is used by default; extend here
custom:
- "AI/ML"
- "Career"
- "Rust"
# LLM settings for category inference
llm:
provider: "anthropic" # anthropic | openai | local | openai-compatible
model: "claude-sonnet-4-6"
api_key_env: "ANTHROPIC_API_KEY"
base_url: null # for local/openai-compatible
failure_category: "Uncategorized"
# Enrichment
enrichment:
enabled: true
concurrency: 3
delay_ms: 1500
retries: 2
timeout_ms: 10000
# Rate limiting (applies to both Gmail API and enrichment)
rate_limit:
gmail_qps: 5 # queries per second to Gmail API
link_concurrency: 3 # parallel link fetches
# State
state_file: "~/.nlc/state.json"
# Parsing plugins
plugins:
substack:
enabled: true
Issue Date Override
For newsletters where the email arrival date doesn't match the issue date, overrides can be configured:
newsletters:
"sender@domain.com":
display_name: "Newsletter Name"
date_override: "subject" # Parse date from subject line
date_format: "%B %d, %Y" # Expected date format in subject
Data Flow
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Gmail API │────▶│ Parse HTML │────▶│ Categorize │────▶│ Write Sheet │
│ (fetch) │ │ + Extract │ │ (hybrid) │ │ (Phase 1) │
└─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ State File │ │ Enrichment │
│ (processed │ │ (Phase 2) │
│ tracking) │ │ Page titles │
└──────────────┘ └──────────────┘
Edge Cases
| Scenario | Behavior |
|---|---|
| Email is a single image with no links | Skip with warning, log to state |
| "View in browser" link instead of content | Fetch the first matching mirror link, extract links from that HTML |
| Same link in multiple newsletters | Keep all occurrences, cross-reference via "Also In" column |
| Same link multiple times in one issue | Deduplicate per-issue; single row per unique URL |
| Link returns 4xx/5xx during enrichment | Move to Dead Links sheet |
| Link is paywalled/auth-required | Include in content sheet, mark Page Title + Meta as "[paywall]" |
| Link times out or has a network error | Include in content sheet, mark Page Title + Meta as "[unreachable: error_type]" |
| Newsletter name > 100 chars | Truncate for sheet name |
| Sheet already exists for newsletter | Append new rows, don't overwrite existing data |
| Gmail API rate limit | Retry with exponential backoff |
| OAuth token expired | Auto-refresh, re-prompt if refresh fails |
| Newsletter format changes | Parser falls back to generic HTML extraction |
Setup & First Run
-
nlc init— Interactive walkthrough:- Authenticate with Gmail (OAuth browser flow)
- Authenticate with Google Sheets (if using Sheets output)
- Select the Gmail folder/label to process
- Configure output location
- Test connectivity
- Generate
config.yaml
-
nlc run --dry-run— Test with 5 most recent emails -
nlc run— Full processing run -
nlc run --enrich-only— Enrich previously extracted links with page titles