Newsletter-Link-Catalog/notes/original SPEC.md at main

keith/Newsletter-Link-Catalog

Fork 0

Files

T

Keith Solomon 265f69d95a ✨feature: First push to git

2026-05-16 14:02:49 -05:00

15 KiB

Raw Permalink Blame History

Overview

A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately.

Architecture

Language & Runtime

TypeScript/Node.js — compiled to a standalone binary via pkg or tsx-bundle
CLI tool invoked as nlc run [flags]

Distribution

Standalone binary — no Node runtime required on the host machine
Built and packaged via CI or build script

Run Modes

Manual: Run nlc run on demand with optional date filters
Scheduled: Can be run via cron/Task Scheduler for recurring processing
Designed for both; no daemon mode required

Gmail Integration

Authentication

OAuth2 browser flow — user authorizes via browser, tokens persisted locally
nlc init command walks through OAuth setup interactively

Scope

Processes emails from a single designated Gmail folder/label (configured in config.yaml)
Does not scan the entire inbox or search by sender patterns

Email Processing

HTML only — plain-text parts are ignored
Image-only emails (single image, no extractable links) are skipped with a warning logged
"View in browser" emails — fetches the web version's HTML and extracts links from that instead
Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails
--full flag forces reprocessing of all emails

Link Extraction & Processing

Extraction Pipeline

Fetch emails from the configured Gmail folder (incremental or full)
Parse HTML to extract links, section headers, and surrounding text
Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links
Unwrap tracking redirects and strip UTM parameters — store only the clean canonical URL
Merge "Read more" links with their preceding content (detected by: same URL + "read more" anchor text)
Categorize each link (see Categorization section)
Write to spreadsheet (see Output section)

Noise Filtering

The following link types are excluded from content sheets:

Unsubscribe links
Social media footer links (Twitter, LinkedIn, etc.)
"Share this newsletter" / "Forward to a friend" links
"View in browser" mirror links (content is extracted from the web version instead)

Sponsor/ad links are not filtered — they go to a separate sheet.

URL Handling

Unwrap all tracking redirects (Mailchimp, Substack, etc.)
Strip UTM parameters and other tracking query params
Store only the clean canonical URL
Dead/broken links (4xx/5xx during enrichment) are moved to a separate "Dead Links" sheet

"Read More" Merging

When two consecutive elements point to the same URL and one has "read more" (or similar) anchor text, they are merged into a single entry combining the preceding description text and the link.

Categorization

Strategy: Hybrid

Primary: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories
Fallback: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords)
Final fallback: LLM-based categorization when rules don't match

Category Taxonomy

LLM-generated by default — the model assigns categories based on link content
Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.)
User can extend via config with custom categories
LLM is instructed to prefer existing categories and only create new ones when nothing fits

LLM Provider Support (BYOK)

All providers supported, configurable in config.yaml:

Claude/Anthropic — Anthropic API
OpenAI/GPT — OpenAI API
Local models — Ollama, LM Studio
OpenAI-compatible endpoints — Mistral, Groq, Together, etc.

Provider config includes: API key, base URL, model name, and optional parameters.

Generic HTML parser as the default
Platform-specific parsers loaded as plugins (detected by URL patterns or email headers)
Substack shipped as the first plugin — uses Substack's predictable HTML structure for more reliable extraction
Additional parsers can be added as plugins without modifying core logic

Output: Spreadsheet

Supported Formats

Google Sheets — via Google Sheets API (live, shareable, auto-updated)
Local Excel (.xlsx) — written to disk, can be uploaded manually

Config selects which output(s) to use; both can be active simultaneously.

Spreadsheet Name

Fixed name set in config.yaml (e.g., "Newsletter Link Catalog")

Sheet Naming

Each newsletter gets its own sheet named after the parsed display name from the email's From header
Names truncated to fit Google Sheets' 100-character limit
Special characters replaced as needed for sheet name validity

Content Sheet Columns

Every row is fully populated (flat table — no blank cells for grouping):

Column	Description
Issue Date	Date from email's Date header (overridable per-newsletter)
Category	Assigned category (from newsletter sections, rules, or LLM)
Link URL	Clean canonical URL after unwrapping and UTM removal
Title	Anchor text / headline from the newsletter
Description	1-2 sentence description from the newsletter (if present)
Page Title + Meta	`<title>` and meta description from the destination page (enrichment phase)
Source Newsletter	Name of the newsletter this link came from
Also In	Cross-reference: other newsletters that also mentioned this link

Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters:

| Column | Description | |---| | Newsletter | Which newsletter this sponsor link appeared in | | Sponsor | Sponsor name (parsed from newsletter) | | Link | Sponsor's link URL | | Description | Sponsor description from the newsletter |

Dead Links Sheet

Single sheet named "Dead Links" for links that returned errors during enrichment:

| Column | Description | |---| | URL | The clean canonical URL | | Status | HTTP status or error type (404, 403, timeout, etc.) | | Source | Newsletter name | | Date | Issue date |

Cross-References

Duplicates across newsletters are kept in their respective sheets (all occurrences preserved)
The Also In column annotates each row with which other newsletters mentioned the same link and when (e.g., "TLDR Web Dev (Mar 5)")
This enables finding cross-newsletter coverage without a separate consolidated sheet

No "All Links" Master Sheet

Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet.

Enrichment

Two-Phase Approach

Phase 1 (Store): Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata
Phase 2 (Enrich): Separate pass to fetch each link's destination page for <title> and meta description

This keeps the initial run fast and allows enrichment to be run independently.

Enrichment Details

Configurable concurrency (safe defaults: 3-5 parallel, 1-2s delay between batches)
Retries on transient failures
Dead links (4xx/5xx) moved to Dead Links sheet
Skip paywalled/auth-required pages (detected by login redirects) — mark with "paywall" status
Progress bar shows enrichment status in real-time

Link Liveness

Dead links are not included in content sheets — they go to the Dead Links sheet
Paywalled/unreachable links are included in content sheets but flagged in the Page Title + Meta column

Processing Model

Incremental Processing

Local state file (JSON) tracks processed Message-IDs and enrichment status
On subsequent runs, only new/unprocessed emails are fetched
--full flag forces reprocessing of all emails
State file location: ~/.nlc/state.json (or configured path)

Date Filtering

--from YYYY-MM-DD and --to YYYY-MM-DD — absolute date range
--last N (e.g., --last 30d, --last 7d) — relative date range
Can be combined with incremental processing

Dry Run

--dry-run processes the most recent X emails (default: 5) without writing to the spreadsheet
Shows what would be extracted, categorized, and written
Useful for testing config changes and parser tweaks

Error Handling

Critical errors (Gmail auth failure, spreadsheet write failure, config errors) → stop execution
Individual errors (one link fails to enrich, one email fails to parse) → log and continue
Summary at end includes error counts and details

Progress & Logging

Progress bar during processing (emails fetched, links extracted, enrichment status)
Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors

CLI Interface

Commands

nlc init          # Interactive setup: OAuth, config file, connectivity test
nlc run [flags]   # Main processing command

`nlc run` Flags

Flag	Description	Default
`--full`	Reprocess all emails, not just new ones	false
`--dry-run [N]`	Process most recent N emails without writing to sheet	5
`--from YYYY-MM-DD`	Process emails from this date	(none)
`--to YYYY-MM-DD`	Process emails up to this date	(none)
`--last N`	Process emails from last N days (e.g., `--last 30d`)	(none)
`--skip-enrich`	Skip the enrichment phase (only extract + categorize)	false
`--enrich-only`	Only run enrichment on already-extracted links	false
`--config PATH`	Path to config file	`./config.yaml`
`--verbose`	Detailed per-email and per-link output	false

Configuration

File Format: YAML

Location: ./config.yaml (overridable with --config)

Sample Structure

# Gmail settings
gmail:
  folder: "Newsletters"           # Gmail label/folder to process
  credentials: "~/.nlc/gmail-credentials.json"
  token: "~/.nlc/gmail-token.json"

# Output settings
output:
  name: "Newsletter Link Catalog"  # Spreadsheet name
  sheets_api:
    enabled: true
    credentials: "~/.nlc/sheets-credentials.json"
    token: "~/.nlc/sheets-token.json"
  excel:
    enabled: true
    path: "./output/newsletter-catalog.xlsx"

# Newsletter identification
newsletters:
  # Manual overrides for parsed display names
  # sender_pattern: "display_name"
  "alex@bytebytego.com": "ByteByteGo"
  "dan@techtakesweekly.com": "Tech Takes Weekly"

# Link processing
links:
  unwrap_redirects: true
  strip_utm: true
  filter_unsubscribe: true
  filter_social_footer: true
  filter_share_links: true
  merge_read_more: true

# Categorization
categories:
  # Built-in taxonomy is used by default; extend here
  custom:
    - "AI/ML"
    - "Career"
    - "Rust"
  # LLM settings for category inference
  llm:
    provider: "anthropic"         # anthropic | openai | local | openai-compatible
    model: "claude-sonnet-4-6"
    api_key_env: "ANTHROPIC_API_KEY"  # or set in env
    base_url: null                # for local/openai-compatible
    fallback_to_rules: true       # if LLM fails, use rule-based

# Enrichment
enrichment:
  enabled: true
  concurrency: 3
  delay_ms: 1500
  retries: 2
  timeout_ms: 10000

# Rate limiting (applies to both Gmail API and enrichment)
rate_limit:
  gmail_qps: 5                    # queries per second to Gmail API
  link_concurrency: 3             # parallel link fetches

# State
state_file: "~/.nlc/state.json"

# Parsing plugins
plugins:
  substack:
    enabled: true

Issue Date Override

For newsletters where the email arrival date doesn't match the issue date, overrides can be configured:

newsletters:
  "sender@domain.com":
    display_name: "Newsletter Name"
    date_override: "subject"      # Parse date from subject line
    date_format: "%B %d, %Y"      # Expected date format in subject

Data Flow

┌─────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Gmail API  │────▶│ Parse HTML   │────▶│  Categorize  │────▶│ Write Sheet  │
│  (fetch)    │     │ + Extract    │     │  (hybrid)    │     │ (Phase 1)    │
└─────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
                           │                                          │
                           ▼                                          ▼
                    ┌──────────────┐                           ┌──────────────┐
                    │  State File  │                           │  Enrichment  │
                    │  (processed  │                           │  (Phase 2)   │
                    │   tracking)  │                           │  Page titles  │
                    └──────────────┘                           └──────────────┘

Edge Cases

Scenario	Behavior
Email is a single image with no links	Skip with warning, log to state
"View in browser" link instead of content	Fetch web version HTML, extract links from that
Same link in multiple newsletters	Keep all occurrences, cross-reference via "Also In" column
Same link multiple times in one issue	Deduplicate per-issue; single row per unique URL
Link returns 4xx/5xx during enrichment	Move to Dead Links sheet
Link is paywalled/auth-required	Include in content sheet, mark Page Title as "[paywall]"
Newsletter name > 100 chars	Truncate for sheet name
Sheet already exists for newsletter	Append new rows, don't overwrite existing data
Gmail API rate limit	Retry with exponential backoff
OAuth token expired	Auto-refresh, re-prompt if refresh fails
Newsletter format changes	Parser falls back to generic HTML extraction

Setup & First Run

nlc init — Interactive walkthrough:
- Authenticate with Gmail (OAuth browser flow)
- Authenticate with Google Sheets (if using Sheets output)
- Select the Gmail folder/label to process
- Configure output location
- Test connectivity
- Generate config.yaml
nlc run --dry-run — Test with 5 most recent emails
nlc run — Full processing run
nlc run --enrich-only — Enrich previously extracted links with page titles

Future Considerations

These are not in scope for v1 but noted for potential future work:

Search/filter functionality within the spreadsheet
Web UI for browsing the catalog
Email forwarding as an alternative to Gmail API access
Automatic category taxonomy refinement based on accumulated data
Plugin system for additional newsletter platforms beyond Substack
Notification on new newsletter processing

15 KiB Raw Permalink Blame History

Newsletter Link Catalog — Specification