✨feature: First push to git
This commit is contained in:
+378
@@ -0,0 +1,378 @@
|
||||
# Newsletter Link Catalog — Specification
|
||||
|
||||
## Overview
|
||||
|
||||
A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Language & Runtime
|
||||
- **TypeScript/Node.js** — compiled to a standalone binary by the project build script
|
||||
- CLI tool invoked as `nlc run [flags]`
|
||||
|
||||
### Distribution
|
||||
- Standalone binary — no Node runtime required on the host machine
|
||||
- Built and packaged via CI or build script
|
||||
- The build script must document the selected bundling tool and produce the binary from a clean checkout
|
||||
|
||||
### Run Modes
|
||||
- **Manual**: Run `nlc run` on demand with optional date filters
|
||||
- **Scheduled**: Can be run via cron/Task Scheduler for recurring processing
|
||||
- Designed for both; no daemon mode required
|
||||
|
||||
## Gmail Integration
|
||||
|
||||
### Authentication
|
||||
- **OAuth2 browser flow** — user authorizes via browser, tokens persisted locally
|
||||
- `nlc init` command walks through OAuth setup interactively
|
||||
|
||||
### Scope
|
||||
- Processes emails from a **single designated Gmail folder/label** (configured in `config.yaml`)
|
||||
- Does not scan the entire inbox or search by sender patterns
|
||||
|
||||
### Email Processing
|
||||
- **HTML only** — plain-text parts are ignored
|
||||
- **Image-only emails** (single image, no extractable links) are skipped with a warning logged
|
||||
- **"View in browser" emails** — if the email contains no content links after noise filtering and contains a mirror link with anchor text matching `view in browser`, `view online`, or `read online`, fetch that mirror URL and extract links from the fetched HTML instead
|
||||
- Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails
|
||||
- `--full` flag forces reprocessing of all emails that match the configured label and any date filters
|
||||
|
||||
## Link Extraction & Processing
|
||||
|
||||
### Extraction Pipeline
|
||||
1. Fetch emails from the configured Gmail folder (incremental or full)
|
||||
2. Parse HTML to extract links, section headers, and surrounding text. A section header is the nearest preceding heading-like element (`h1`-`h6`, table row header, or bold standalone line) within the same content block.
|
||||
3. Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links
|
||||
4. Unwrap supported tracking redirects and strip configured tracking query parameters — store the normalized destination URL
|
||||
5. Merge "Read more" links with their preceding content (detected by: consecutive links with the same normalized URL and anchor text matching the configured read-more pattern)
|
||||
6. Categorize each link (see Categorization section)
|
||||
7. Write to spreadsheet (see Output section)
|
||||
|
||||
### Noise Filtering
|
||||
The following link types are **excluded** from content sheets:
|
||||
- Unsubscribe links
|
||||
- Social media links in footer or sharing blocks
|
||||
- Links whose anchor text or accessible label matches configured share/forward patterns
|
||||
- "View in browser" mirror links (content is extracted from the web version instead)
|
||||
|
||||
Sponsor/ad links are **not filtered** — they go to a separate sheet when the link is inside a block labeled with configured sponsor markers such as "sponsor", "sponsored", "ad", "advertisement", or "partner".
|
||||
|
||||
### URL Handling
|
||||
- Unwrap HTTP redirects and supported provider redirect URLs up to the configured redirect limit
|
||||
- Strip configured tracking query parameters, including `utm_*`, `fbclid`, `gclid`, `mc_cid`, `mc_eid`, and provider-specific tracking parameters listed in config
|
||||
- Store the normalized destination URL after redirect unwrapping and query cleanup
|
||||
- Dead/broken links (4xx/5xx during enrichment) are written to the "Dead Links" sheet and removed from content sheets when they were already written by an earlier phase or run
|
||||
|
||||
### "Read More" Merging
|
||||
When two consecutive extracted links point to the same normalized URL and one anchor text matches the configured read-more pattern, they are merged into a single entry combining the preceding link title/description with the read-more link URL.
|
||||
|
||||
## Categorization
|
||||
|
||||
### Strategy: Hybrid
|
||||
1. **Primary**: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories
|
||||
2. **Fallback**: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords)
|
||||
3. **Final fallback**: LLM-based categorization when rules don't match
|
||||
|
||||
### Category Taxonomy
|
||||
- Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.)
|
||||
- User can extend via config with custom categories
|
||||
- For fallback categorization, the LLM is instructed to prefer configured categories and may create a new category only when no existing category fits
|
||||
|
||||
### LLM Provider Support (BYOK)
|
||||
The tool supports a provider adapter interface and ships adapters for:
|
||||
- **Claude/Anthropic** — Anthropic API
|
||||
- **OpenAI/GPT** — OpenAI API
|
||||
- **Local models** — Ollama, LM Studio
|
||||
- **OpenAI-compatible endpoints** — Mistral, Groq, Together, etc.
|
||||
|
||||
Provider config includes: API key environment variable, base URL when required, model name, and optional provider parameters.
|
||||
|
||||
### Newsletter Parsing: Plugin System
|
||||
- Generic HTML parser as the default
|
||||
- Platform-specific parsers loaded as plugins (detected by URL patterns or email headers)
|
||||
- **Substack** shipped as the first plugin — maps Substack-specific HTML structures to the common extracted-link format
|
||||
- Additional parsers can be added as plugins without modifying core logic
|
||||
|
||||
## Output: Spreadsheet
|
||||
|
||||
### Supported Formats
|
||||
- **Google Sheets** — via Google Sheets API (live, shareable, updated by each write run)
|
||||
- **Local Excel (.xlsx)** — written to disk, can be uploaded manually
|
||||
|
||||
Config selects which output(s) to use; both can be active simultaneously.
|
||||
|
||||
### Spreadsheet Name
|
||||
- Fixed name set in `config.yaml` (e.g., "Newsletter Link Catalog")
|
||||
|
||||
### Sheet Naming
|
||||
- Each newsletter gets its own sheet named after the parsed display name from the email's From header
|
||||
- Names truncated to fit Google Sheets' 100-character limit
|
||||
- Characters invalid for Google Sheets or Excel sheet names are replaced with spaces, then repeated whitespace is collapsed
|
||||
|
||||
### Content Sheet Columns
|
||||
Every link occurrence is written as a flat row; blank grouping rows are not used. Fields unavailable from the source are written as empty cells.
|
||||
|
||||
| Column | Description |
|
||||
|---|---|
|
||||
| Issue Date | Date from email's Date header (overridable per-newsletter) |
|
||||
| Category | Assigned category (from newsletter sections, rules, or LLM) |
|
||||
| Link URL | Clean canonical URL after unwrapping and UTM removal |
|
||||
| Title | Anchor text / headline from the newsletter |
|
||||
| Description | 1-2 sentence description from the newsletter (if present) |
|
||||
| Page Title + Meta | `<title>` and meta description from the destination page (enrichment phase) |
|
||||
| Source Newsletter | Name of the newsletter this link came from |
|
||||
| Also In | Cross-reference: other newsletters that also mentioned this link |
|
||||
|
||||
### Sponsor Sheet (Consolidated)
|
||||
Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters:
|
||||
|
||||
| Column | Description |
|
||||
|---|
|
||||
| Newsletter | Which newsletter this sponsor link appeared in |
|
||||
| Sponsor | Sponsor name (parsed from newsletter) |
|
||||
| Link | Sponsor's link URL |
|
||||
| Description | Sponsor description from the newsletter |
|
||||
|
||||
### Dead Links Sheet
|
||||
Single sheet named "Dead Links" for links that returned errors during enrichment:
|
||||
|
||||
| Column | Description |
|
||||
|---|
|
||||
| URL | The clean canonical URL |
|
||||
| Status | HTTP status or error type (404, 403, timeout, etc.) |
|
||||
| Source | Newsletter name |
|
||||
| Date | Issue date |
|
||||
|
||||
### Cross-References
|
||||
- Duplicates across newsletters are kept in their respective sheets (all occurrences preserved)
|
||||
- The **Also In** column annotates each row with other newsletter issues that mentioned the same normalized URL, formatted as `Newsletter Name (YYYY-MM-DD)` and joined with `; `
|
||||
- This enables finding cross-newsletter coverage without a separate consolidated sheet
|
||||
|
||||
### No "All Links" Master Sheet
|
||||
Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet.
|
||||
|
||||
## Enrichment
|
||||
|
||||
### Two-Phase Approach
|
||||
1. **Phase 1 (Store)**: Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata
|
||||
2. **Phase 2 (Enrich)**: Separate pass to fetch each link's destination page for `<title>` and meta description
|
||||
|
||||
Enrichment can be run independently from extraction and spreadsheet writing.
|
||||
|
||||
### Enrichment Details
|
||||
- Configurable concurrency with defaults of 3 parallel requests and 1500 ms delay between batches
|
||||
- Retries on transient failures
|
||||
- Dead links (4xx/5xx) are written to the Dead Links sheet and removed from content sheets when they were already written by an earlier phase or run
|
||||
- Skip pages that redirect to a URL whose path or query contains `login`, `signin`, `subscribe`, or `paywall` — mark with "paywall" status
|
||||
- Progress bar updates after each completed enrichment request
|
||||
|
||||
### Link Liveness
|
||||
- Dead links are **not included** in content sheets — they go to the Dead Links sheet
|
||||
- Paywalled links are included in content sheets and the Page Title + Meta column is set to `[paywall]`
|
||||
- Timeout, DNS, TLS, and network failures are included in content sheets and the Page Title + Meta column is set to `[unreachable: error_type]`
|
||||
|
||||
## Processing Model
|
||||
|
||||
### Incremental Processing
|
||||
- Local state file (JSON) tracks processed Message-IDs and enrichment status
|
||||
- On subsequent runs, only new/unprocessed emails are fetched
|
||||
- `--full` flag forces reprocessing of all emails that match the configured label and any date filters
|
||||
- State file location: `~/.nlc/state.json` (or configured path)
|
||||
|
||||
### Date Filtering
|
||||
- `--from YYYY-MM-DD` and `--to YYYY-MM-DD` — absolute date range
|
||||
- `--last N` (e.g., `--last 30d`, `--last 7d`) — relative date range
|
||||
- Date filters apply before the incremental processed-message check
|
||||
- If both `--last` and `--from`/`--to` are provided, the CLI exits with a config error
|
||||
|
||||
### Dry Run
|
||||
- `--dry-run` processes the most recent N emails (default: 5) without writing to the spreadsheet
|
||||
- Shows what would be extracted, categorized, and written
|
||||
- Dry run does not update the state file or call destination pages for enrichment unless `--dry-run` is combined with `--enrich-only`
|
||||
|
||||
### Error Handling
|
||||
- **Critical errors** (Gmail auth failure, spreadsheet write failure, config errors) → stop execution
|
||||
- **Individual errors** (one link fails to enrich, one email fails to parse) → log and continue
|
||||
- Summary at end includes error counts and details
|
||||
|
||||
### Progress & Logging
|
||||
- Progress bar during processing (emails fetched, links extracted, enrichment status)
|
||||
- Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors
|
||||
|
||||
## CLI Interface
|
||||
|
||||
### Commands
|
||||
|
||||
```
|
||||
nlc init # Interactive setup: OAuth, config file, connectivity test
|
||||
nlc run [flags] # Main processing command
|
||||
```
|
||||
|
||||
### `nlc run` Flags
|
||||
|
||||
| Flag | Description | Default |
|
||||
|---|---|---|
|
||||
| `--full` | Reprocess all emails, not just new ones | false |
|
||||
| `--dry-run [N]` | Process most recent N emails without writing to sheet | 5 |
|
||||
| `--from YYYY-MM-DD` | Process emails from this date | (none) |
|
||||
| `--to YYYY-MM-DD` | Process emails up to this date | (none) |
|
||||
| `--last N` | Process emails from last N days (e.g., `--last 30d`) | (none) |
|
||||
| `--skip-enrich` | Skip the enrichment phase (only extract + categorize) | false |
|
||||
| `--enrich-only` | Only run enrichment on already-extracted links | false |
|
||||
| `--config PATH` | Path to config file | `./config.yaml` |
|
||||
| `--verbose` | Detailed per-email and per-link output | false |
|
||||
|
||||
## Configuration
|
||||
|
||||
### File Format: YAML
|
||||
Location: `./config.yaml` (overridable with `--config`)
|
||||
|
||||
### Sample Structure
|
||||
|
||||
```yaml
|
||||
# Gmail settings
|
||||
gmail:
|
||||
folder: "Newsletters" # Gmail label/folder to process
|
||||
credentials: "~/.nlc/gmail-credentials.json"
|
||||
token: "~/.nlc/gmail-token.json"
|
||||
|
||||
# Output settings
|
||||
output:
|
||||
name: "Newsletter Link Catalog" # Spreadsheet name
|
||||
sheets_api:
|
||||
enabled: true
|
||||
credentials: "~/.nlc/sheets-credentials.json"
|
||||
token: "~/.nlc/sheets-token.json"
|
||||
excel:
|
||||
enabled: true
|
||||
path: "./output/newsletter-catalog.xlsx"
|
||||
|
||||
# Newsletter identification
|
||||
newsletters:
|
||||
# Manual overrides for parsed display names
|
||||
"alex@bytebytego.com":
|
||||
display_name: "ByteByteGo"
|
||||
"dan@techtakesweekly.com":
|
||||
display_name: "Tech Takes Weekly"
|
||||
|
||||
# Link processing
|
||||
links:
|
||||
unwrap_redirects: true
|
||||
strip_utm: true
|
||||
tracking_params:
|
||||
- "utm_*"
|
||||
- "fbclid"
|
||||
- "gclid"
|
||||
- "mc_cid"
|
||||
- "mc_eid"
|
||||
redirect_limit: 5
|
||||
read_more_pattern: "(?i)^(read more|continue reading|learn more)$"
|
||||
share_patterns:
|
||||
- "(?i)share"
|
||||
- "(?i)forward to a friend"
|
||||
sponsor_markers:
|
||||
- "(?i)sponsor"
|
||||
- "(?i)sponsored"
|
||||
- "(?i)advertisement"
|
||||
- "(?i)partner"
|
||||
filter_unsubscribe: true
|
||||
filter_social_footer: true
|
||||
filter_share_links: true
|
||||
merge_read_more: true
|
||||
|
||||
# Categorization
|
||||
categories:
|
||||
# Built-in taxonomy is used by default; extend here
|
||||
custom:
|
||||
- "AI/ML"
|
||||
- "Career"
|
||||
- "Rust"
|
||||
# LLM settings for category inference
|
||||
llm:
|
||||
provider: "anthropic" # anthropic | openai | local | openai-compatible
|
||||
model: "claude-sonnet-4-6"
|
||||
api_key_env: "ANTHROPIC_API_KEY"
|
||||
base_url: null # for local/openai-compatible
|
||||
failure_category: "Uncategorized"
|
||||
|
||||
# Enrichment
|
||||
enrichment:
|
||||
enabled: true
|
||||
concurrency: 3
|
||||
delay_ms: 1500
|
||||
retries: 2
|
||||
timeout_ms: 10000
|
||||
|
||||
# Rate limiting (applies to both Gmail API and enrichment)
|
||||
rate_limit:
|
||||
gmail_qps: 5 # queries per second to Gmail API
|
||||
link_concurrency: 3 # parallel link fetches
|
||||
|
||||
# State
|
||||
state_file: "~/.nlc/state.json"
|
||||
|
||||
# Parsing plugins
|
||||
plugins:
|
||||
substack:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
### Issue Date Override
|
||||
For newsletters where the email arrival date doesn't match the issue date, overrides can be configured:
|
||||
|
||||
```yaml
|
||||
newsletters:
|
||||
"sender@domain.com":
|
||||
display_name: "Newsletter Name"
|
||||
date_override: "subject" # Parse date from subject line
|
||||
date_format: "%B %d, %Y" # Expected date format in subject
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Gmail API │────▶│ Parse HTML │────▶│ Categorize │────▶│ Write Sheet │
|
||||
│ (fetch) │ │ + Extract │ │ (hybrid) │ │ (Phase 1) │
|
||||
└─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ State File │ │ Enrichment │
|
||||
│ (processed │ │ (Phase 2) │
|
||||
│ tracking) │ │ Page titles │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
## Edge Cases
|
||||
|
||||
| Scenario | Behavior |
|
||||
|---|---|
|
||||
| Email is a single image with no links | Skip with warning, log to state |
|
||||
| "View in browser" link instead of content | Fetch the first matching mirror link, extract links from that HTML |
|
||||
| Same link in multiple newsletters | Keep all occurrences, cross-reference via "Also In" column |
|
||||
| Same link multiple times in one issue | Deduplicate per-issue; single row per unique URL |
|
||||
| Link returns 4xx/5xx during enrichment | Move to Dead Links sheet |
|
||||
| Link is paywalled/auth-required | Include in content sheet, mark Page Title + Meta as "[paywall]" |
|
||||
| Link times out or has a network error | Include in content sheet, mark Page Title + Meta as "[unreachable: error_type]" |
|
||||
| Newsletter name > 100 chars | Truncate for sheet name |
|
||||
| Sheet already exists for newsletter | Append new rows, don't overwrite existing data |
|
||||
| Gmail API rate limit | Retry with exponential backoff |
|
||||
| OAuth token expired | Auto-refresh, re-prompt if refresh fails |
|
||||
| Newsletter format changes | Parser falls back to generic HTML extraction |
|
||||
|
||||
## Setup & First Run
|
||||
|
||||
1. **`nlc init`** — Interactive walkthrough:
|
||||
- Authenticate with Gmail (OAuth browser flow)
|
||||
- Authenticate with Google Sheets (if using Sheets output)
|
||||
- Select the Gmail folder/label to process
|
||||
- Configure output location
|
||||
- Test connectivity
|
||||
- Generate `config.yaml`
|
||||
|
||||
2. **`nlc run --dry-run`** — Test with 5 most recent emails
|
||||
|
||||
3. **`nlc run`** — Full processing run
|
||||
|
||||
4. **`nlc run --enrich-only`** — Enrich previously extracted links with page titles
|
||||
Reference in New Issue
Block a user