✨feature: First push to git
This commit is contained in:
+158
@@ -0,0 +1,158 @@
|
||||
/goal
|
||||
|
||||
<task>
|
||||
You are an autonomous senior engineer working in:
|
||||
C:\Users\ksolo\Projects\Misc Projects\Newletter Link Catalog
|
||||
|
||||
Implement the Newsletter Link Catalog CLI described in SPEC.md end-to-end.
|
||||
|
||||
The expected product is a TypeScript/Node.js CLI named `nlc` with:
|
||||
- `nlc init`
|
||||
- `nlc run [flags]`
|
||||
- Gmail OAuth browser auth and local token persistence
|
||||
- config-driven Gmail label/folder processing
|
||||
- HTML newsletter parsing and link extraction
|
||||
- noise filtering, tracking URL cleanup, redirect unwrapping, read-more merging
|
||||
- hybrid categorization using section headers, rules, and optional LLM providers
|
||||
- parser plugin architecture with a generic parser and Substack plugin
|
||||
- Google Sheets and local `.xlsx` outputs
|
||||
- incremental state tracking in JSON
|
||||
- enrichment pass for page title/meta, dead-link handling, paywall/unreachable markers
|
||||
- dry-run, date filtering, full reprocess, skip-enrich, enrich-only, config, and verbose flags
|
||||
- standalone binary build script and documentation for the selected bundling tool
|
||||
|
||||
Use SPEC.md as the source of truth. If existing code conflicts with SPEC.md, prefer SPEC.md unless repo-local instructions explicitly require otherwise.
|
||||
|
||||
The repository-level working agreements mention PHP tooling, but this project spec is TypeScript/Node.js. Apply the relevant JS quality rules: ESLint airbnb/base, Prettier, tests, secure input/output
|
||||
handling, and CI-style validation. Do not add PHP tooling unless PHP files already exist and require it.
|
||||
</task>
|
||||
|
||||
<goal>
|
||||
Build a production-quality CLI that meets the SPEC.md requirements, adheres to the working agreements, and can be used by the repository owner to catalog newsletter links from Gmail into Google Sheets with confidence in correctness, safety, and maintainability.
|
||||
</goal>
|
||||
|
||||
<default_follow_through_policy>
|
||||
Default to the most reasonable low-risk interpretation and keep going.
|
||||
Only stop to ask when a missing detail changes correctness, safety, external credentials, or an irreversible action.
|
||||
When external services or credentials are unavailable, implement the integration boundary, tests, mocks, and clear setup docs instead of blocking.
|
||||
</default_follow_through_policy>
|
||||
|
||||
<completeness_contract>
|
||||
Persist until the task is fully handled end-to-end within the current turn whenever feasible: do not stop at analysis or partial fixes.
|
||||
Treat the task as incomplete until every major SPEC.md behavior is implemented, tested, documented, or explicitly marked [blocked] with evidence.
|
||||
Before finishing, reconcile every plan item: Done, Blocked, or Cancelled. Never leave items in-progress.
|
||||
Do not claim completion until validation has run and failures are fixed or explained with concrete blocker evidence.
|
||||
</completeness_contract>
|
||||
|
||||
<missing_context_gating>
|
||||
Read SPEC.md and inspect the repository before planning implementation.
|
||||
Do not guess repository structure, package manager, test framework, or existing scripts. Retrieve them with tools.
|
||||
If the repo is empty or nearly empty, scaffold a TypeScript CLI project using npm unless an existing package manager is present.
|
||||
If credentials, live Gmail, Google Sheets, or LLM API keys are missing, use mocks/fakes for automated tests and document the required environment variables and setup.
|
||||
</missing_context_gating>
|
||||
|
||||
<tool_persistence_rules>
|
||||
Prefer dedicated tools over raw shell where available: rg, read_file/list_dir equivalents, apply_patch, and update_plan.
|
||||
Use rg or rg --files for search.
|
||||
Parallelize independent file reads; sequence dependent actions.
|
||||
Use apply_patch for manual edits.
|
||||
Keep using tools until you have enough evidence to finish confidently.
|
||||
</tool_persistence_rules>
|
||||
|
||||
<implementation_requirements>
|
||||
Implement a clean modular architecture, with separate modules for:
|
||||
- CLI command parsing
|
||||
- config loading and validation
|
||||
- Gmail OAuth/auth/client access
|
||||
- Gmail message fetching by configured label
|
||||
- HTML parsing and extraction
|
||||
- noise filtering
|
||||
- URL normalization, redirect unwrapping, and tracking parameter stripping
|
||||
- categorization
|
||||
- LLM provider adapters
|
||||
- parser plugins
|
||||
- spreadsheet writers
|
||||
- enrichment
|
||||
- state persistence
|
||||
- logging/progress reporting
|
||||
|
||||
Implement provider adapters for:
|
||||
- Anthropic
|
||||
- OpenAI
|
||||
- local/Ollama or LM Studio style endpoints
|
||||
- OpenAI-compatible endpoints
|
||||
|
||||
Implement output writers for:
|
||||
- Google Sheets API
|
||||
- local Excel `.xlsx`
|
||||
|
||||
Implement tests for core behavior without requiring live external services:
|
||||
- config validation
|
||||
- date filter conflict handling
|
||||
- sheet-name sanitization/truncation
|
||||
- URL cleanup and tracking parameter stripping
|
||||
- read-more link merging
|
||||
- noise filtering
|
||||
- sponsor detection
|
||||
- section-header categorization
|
||||
- fallback rule categorization
|
||||
- state-file incremental behavior
|
||||
- dead/paywall/unreachable enrichment handling
|
||||
- dry-run state/write suppression
|
||||
- parser plugin selection, including Substack
|
||||
</implementation_requirements>
|
||||
|
||||
<action_safety>
|
||||
Keep changes tightly scoped to building this CLI.
|
||||
Avoid unrelated refactors, renames, or cleanup.
|
||||
Do not run destructive git commands such as reset --hard or checkout -- without explicit approval.
|
||||
Never commit secrets, tokens, OAuth credentials, spreadsheet IDs, or user data.
|
||||
Persist tokens only in documented local paths such as ~/.nlc.
|
||||
Sanitize config and CLI inputs. Escape or safely encode spreadsheet cell values that could become formulas.
|
||||
Handle critical errors by stopping with a useful message. Handle individual email/link failures by logging and continuing.
|
||||
</action_safety>
|
||||
|
||||
<verification_loop>
|
||||
Required validations:
|
||||
- npm install
|
||||
- npm run lint
|
||||
- npm run format:check
|
||||
- npm run typecheck
|
||||
- npm test
|
||||
- npm run build
|
||||
- npm run smoke
|
||||
|
||||
If the package scripts do not exist yet, create them.
|
||||
`npm run build` must compile the TypeScript project and produce the standalone binary or packaged executable artifact described in the docs.
|
||||
`npm run smoke` must exercise the CLI without live credentials, at minimum:
|
||||
- `nlc --help`
|
||||
- `nlc init --help`
|
||||
- `nlc run --help`
|
||||
- a dry-run or fixture-backed run path that proves parsing/output orchestration works without mutating real Gmail or Sheets.
|
||||
|
||||
Before finalizing, run the required validations.
|
||||
If a check fails, fix the cause and rerun until green or until a real external blocker remains.
|
||||
Report any unavailable live-service validation separately from local automated validation.
|
||||
</verification_loop>
|
||||
|
||||
<progress_updates>
|
||||
For long work, give brief progress updates after meaningful milestones:
|
||||
- repo inspection complete
|
||||
- implementation plan formed
|
||||
- core modules scaffolded
|
||||
- tests added
|
||||
- validations running
|
||||
- final validation result
|
||||
Keep updates concise and outcome-based.
|
||||
</progress_updates>
|
||||
|
||||
<structured_output_contract>
|
||||
Final report exactly in this order:
|
||||
1. Summary: 2-4 bullets describing what was built.
|
||||
2. Changed files: one line per important file or directory.
|
||||
3. Validations: each command run and its result.
|
||||
4. Blockers or residual risks: include only real remaining issues.
|
||||
5. Next operational steps: credential/setup steps needed for live Gmail or Google Sheets use.
|
||||
|
||||
Keep the final report compact and highest-signal first.
|
||||
</structured_output_contract>
|
||||
+378
@@ -0,0 +1,378 @@
|
||||
# Newsletter Link Catalog — Specification
|
||||
|
||||
## Overview
|
||||
|
||||
A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Language & Runtime
|
||||
- **TypeScript/Node.js** — compiled to a standalone binary by the project build script
|
||||
- CLI tool invoked as `nlc run [flags]`
|
||||
|
||||
### Distribution
|
||||
- Standalone binary — no Node runtime required on the host machine
|
||||
- Built and packaged via CI or build script
|
||||
- The build script must document the selected bundling tool and produce the binary from a clean checkout
|
||||
|
||||
### Run Modes
|
||||
- **Manual**: Run `nlc run` on demand with optional date filters
|
||||
- **Scheduled**: Can be run via cron/Task Scheduler for recurring processing
|
||||
- Designed for both; no daemon mode required
|
||||
|
||||
## Gmail Integration
|
||||
|
||||
### Authentication
|
||||
- **OAuth2 browser flow** — user authorizes via browser, tokens persisted locally
|
||||
- `nlc init` command walks through OAuth setup interactively
|
||||
|
||||
### Scope
|
||||
- Processes emails from a **single designated Gmail folder/label** (configured in `config.yaml`)
|
||||
- Does not scan the entire inbox or search by sender patterns
|
||||
|
||||
### Email Processing
|
||||
- **HTML only** — plain-text parts are ignored
|
||||
- **Image-only emails** (single image, no extractable links) are skipped with a warning logged
|
||||
- **"View in browser" emails** — if the email contains no content links after noise filtering and contains a mirror link with anchor text matching `view in browser`, `view online`, or `read online`, fetch that mirror URL and extract links from the fetched HTML instead
|
||||
- Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails
|
||||
- `--full` flag forces reprocessing of all emails that match the configured label and any date filters
|
||||
|
||||
## Link Extraction & Processing
|
||||
|
||||
### Extraction Pipeline
|
||||
1. Fetch emails from the configured Gmail folder (incremental or full)
|
||||
2. Parse HTML to extract links, section headers, and surrounding text. A section header is the nearest preceding heading-like element (`h1`-`h6`, table row header, or bold standalone line) within the same content block.
|
||||
3. Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links
|
||||
4. Unwrap supported tracking redirects and strip configured tracking query parameters — store the normalized destination URL
|
||||
5. Merge "Read more" links with their preceding content (detected by: consecutive links with the same normalized URL and anchor text matching the configured read-more pattern)
|
||||
6. Categorize each link (see Categorization section)
|
||||
7. Write to spreadsheet (see Output section)
|
||||
|
||||
### Noise Filtering
|
||||
The following link types are **excluded** from content sheets:
|
||||
- Unsubscribe links
|
||||
- Social media links in footer or sharing blocks
|
||||
- Links whose anchor text or accessible label matches configured share/forward patterns
|
||||
- "View in browser" mirror links (content is extracted from the web version instead)
|
||||
|
||||
Sponsor/ad links are **not filtered** — they go to a separate sheet when the link is inside a block labeled with configured sponsor markers such as "sponsor", "sponsored", "ad", "advertisement", or "partner".
|
||||
|
||||
### URL Handling
|
||||
- Unwrap HTTP redirects and supported provider redirect URLs up to the configured redirect limit
|
||||
- Strip configured tracking query parameters, including `utm_*`, `fbclid`, `gclid`, `mc_cid`, `mc_eid`, and provider-specific tracking parameters listed in config
|
||||
- Store the normalized destination URL after redirect unwrapping and query cleanup
|
||||
- Dead/broken links (4xx/5xx during enrichment) are written to the "Dead Links" sheet and removed from content sheets when they were already written by an earlier phase or run
|
||||
|
||||
### "Read More" Merging
|
||||
When two consecutive extracted links point to the same normalized URL and one anchor text matches the configured read-more pattern, they are merged into a single entry combining the preceding link title/description with the read-more link URL.
|
||||
|
||||
## Categorization
|
||||
|
||||
### Strategy: Hybrid
|
||||
1. **Primary**: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories
|
||||
2. **Fallback**: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords)
|
||||
3. **Final fallback**: LLM-based categorization when rules don't match
|
||||
|
||||
### Category Taxonomy
|
||||
- Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.)
|
||||
- User can extend via config with custom categories
|
||||
- For fallback categorization, the LLM is instructed to prefer configured categories and may create a new category only when no existing category fits
|
||||
|
||||
### LLM Provider Support (BYOK)
|
||||
The tool supports a provider adapter interface and ships adapters for:
|
||||
- **Claude/Anthropic** — Anthropic API
|
||||
- **OpenAI/GPT** — OpenAI API
|
||||
- **Local models** — Ollama, LM Studio
|
||||
- **OpenAI-compatible endpoints** — Mistral, Groq, Together, etc.
|
||||
|
||||
Provider config includes: API key environment variable, base URL when required, model name, and optional provider parameters.
|
||||
|
||||
### Newsletter Parsing: Plugin System
|
||||
- Generic HTML parser as the default
|
||||
- Platform-specific parsers loaded as plugins (detected by URL patterns or email headers)
|
||||
- **Substack** shipped as the first plugin — maps Substack-specific HTML structures to the common extracted-link format
|
||||
- Additional parsers can be added as plugins without modifying core logic
|
||||
|
||||
## Output: Spreadsheet
|
||||
|
||||
### Supported Formats
|
||||
- **Google Sheets** — via Google Sheets API (live, shareable, updated by each write run)
|
||||
- **Local Excel (.xlsx)** — written to disk, can be uploaded manually
|
||||
|
||||
Config selects which output(s) to use; both can be active simultaneously.
|
||||
|
||||
### Spreadsheet Name
|
||||
- Fixed name set in `config.yaml` (e.g., "Newsletter Link Catalog")
|
||||
|
||||
### Sheet Naming
|
||||
- Each newsletter gets its own sheet named after the parsed display name from the email's From header
|
||||
- Names truncated to fit Google Sheets' 100-character limit
|
||||
- Characters invalid for Google Sheets or Excel sheet names are replaced with spaces, then repeated whitespace is collapsed
|
||||
|
||||
### Content Sheet Columns
|
||||
Every link occurrence is written as a flat row; blank grouping rows are not used. Fields unavailable from the source are written as empty cells.
|
||||
|
||||
| Column | Description |
|
||||
|---|---|
|
||||
| Issue Date | Date from email's Date header (overridable per-newsletter) |
|
||||
| Category | Assigned category (from newsletter sections, rules, or LLM) |
|
||||
| Link URL | Clean canonical URL after unwrapping and UTM removal |
|
||||
| Title | Anchor text / headline from the newsletter |
|
||||
| Description | 1-2 sentence description from the newsletter (if present) |
|
||||
| Page Title + Meta | `<title>` and meta description from the destination page (enrichment phase) |
|
||||
| Source Newsletter | Name of the newsletter this link came from |
|
||||
| Also In | Cross-reference: other newsletters that also mentioned this link |
|
||||
|
||||
### Sponsor Sheet (Consolidated)
|
||||
Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters:
|
||||
|
||||
| Column | Description |
|
||||
|---|
|
||||
| Newsletter | Which newsletter this sponsor link appeared in |
|
||||
| Sponsor | Sponsor name (parsed from newsletter) |
|
||||
| Link | Sponsor's link URL |
|
||||
| Description | Sponsor description from the newsletter |
|
||||
|
||||
### Dead Links Sheet
|
||||
Single sheet named "Dead Links" for links that returned errors during enrichment:
|
||||
|
||||
| Column | Description |
|
||||
|---|
|
||||
| URL | The clean canonical URL |
|
||||
| Status | HTTP status or error type (404, 403, timeout, etc.) |
|
||||
| Source | Newsletter name |
|
||||
| Date | Issue date |
|
||||
|
||||
### Cross-References
|
||||
- Duplicates across newsletters are kept in their respective sheets (all occurrences preserved)
|
||||
- The **Also In** column annotates each row with other newsletter issues that mentioned the same normalized URL, formatted as `Newsletter Name (YYYY-MM-DD)` and joined with `; `
|
||||
- This enables finding cross-newsletter coverage without a separate consolidated sheet
|
||||
|
||||
### No "All Links" Master Sheet
|
||||
Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet.
|
||||
|
||||
## Enrichment
|
||||
|
||||
### Two-Phase Approach
|
||||
1. **Phase 1 (Store)**: Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata
|
||||
2. **Phase 2 (Enrich)**: Separate pass to fetch each link's destination page for `<title>` and meta description
|
||||
|
||||
Enrichment can be run independently from extraction and spreadsheet writing.
|
||||
|
||||
### Enrichment Details
|
||||
- Configurable concurrency with defaults of 3 parallel requests and 1500 ms delay between batches
|
||||
- Retries on transient failures
|
||||
- Dead links (4xx/5xx) are written to the Dead Links sheet and removed from content sheets when they were already written by an earlier phase or run
|
||||
- Skip pages that redirect to a URL whose path or query contains `login`, `signin`, `subscribe`, or `paywall` — mark with "paywall" status
|
||||
- Progress bar updates after each completed enrichment request
|
||||
|
||||
### Link Liveness
|
||||
- Dead links are **not included** in content sheets — they go to the Dead Links sheet
|
||||
- Paywalled links are included in content sheets and the Page Title + Meta column is set to `[paywall]`
|
||||
- Timeout, DNS, TLS, and network failures are included in content sheets and the Page Title + Meta column is set to `[unreachable: error_type]`
|
||||
|
||||
## Processing Model
|
||||
|
||||
### Incremental Processing
|
||||
- Local state file (JSON) tracks processed Message-IDs and enrichment status
|
||||
- On subsequent runs, only new/unprocessed emails are fetched
|
||||
- `--full` flag forces reprocessing of all emails that match the configured label and any date filters
|
||||
- State file location: `~/.nlc/state.json` (or configured path)
|
||||
|
||||
### Date Filtering
|
||||
- `--from YYYY-MM-DD` and `--to YYYY-MM-DD` — absolute date range
|
||||
- `--last N` (e.g., `--last 30d`, `--last 7d`) — relative date range
|
||||
- Date filters apply before the incremental processed-message check
|
||||
- If both `--last` and `--from`/`--to` are provided, the CLI exits with a config error
|
||||
|
||||
### Dry Run
|
||||
- `--dry-run` processes the most recent N emails (default: 5) without writing to the spreadsheet
|
||||
- Shows what would be extracted, categorized, and written
|
||||
- Dry run does not update the state file or call destination pages for enrichment unless `--dry-run` is combined with `--enrich-only`
|
||||
|
||||
### Error Handling
|
||||
- **Critical errors** (Gmail auth failure, spreadsheet write failure, config errors) → stop execution
|
||||
- **Individual errors** (one link fails to enrich, one email fails to parse) → log and continue
|
||||
- Summary at end includes error counts and details
|
||||
|
||||
### Progress & Logging
|
||||
- Progress bar during processing (emails fetched, links extracted, enrichment status)
|
||||
- Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors
|
||||
|
||||
## CLI Interface
|
||||
|
||||
### Commands
|
||||
|
||||
```
|
||||
nlc init # Interactive setup: OAuth, config file, connectivity test
|
||||
nlc run [flags] # Main processing command
|
||||
```
|
||||
|
||||
### `nlc run` Flags
|
||||
|
||||
| Flag | Description | Default |
|
||||
|---|---|---|
|
||||
| `--full` | Reprocess all emails, not just new ones | false |
|
||||
| `--dry-run [N]` | Process most recent N emails without writing to sheet | 5 |
|
||||
| `--from YYYY-MM-DD` | Process emails from this date | (none) |
|
||||
| `--to YYYY-MM-DD` | Process emails up to this date | (none) |
|
||||
| `--last N` | Process emails from last N days (e.g., `--last 30d`) | (none) |
|
||||
| `--skip-enrich` | Skip the enrichment phase (only extract + categorize) | false |
|
||||
| `--enrich-only` | Only run enrichment on already-extracted links | false |
|
||||
| `--config PATH` | Path to config file | `./config.yaml` |
|
||||
| `--verbose` | Detailed per-email and per-link output | false |
|
||||
|
||||
## Configuration
|
||||
|
||||
### File Format: YAML
|
||||
Location: `./config.yaml` (overridable with `--config`)
|
||||
|
||||
### Sample Structure
|
||||
|
||||
```yaml
|
||||
# Gmail settings
|
||||
gmail:
|
||||
folder: "Newsletters" # Gmail label/folder to process
|
||||
credentials: "~/.nlc/gmail-credentials.json"
|
||||
token: "~/.nlc/gmail-token.json"
|
||||
|
||||
# Output settings
|
||||
output:
|
||||
name: "Newsletter Link Catalog" # Spreadsheet name
|
||||
sheets_api:
|
||||
enabled: true
|
||||
credentials: "~/.nlc/sheets-credentials.json"
|
||||
token: "~/.nlc/sheets-token.json"
|
||||
excel:
|
||||
enabled: true
|
||||
path: "./output/newsletter-catalog.xlsx"
|
||||
|
||||
# Newsletter identification
|
||||
newsletters:
|
||||
# Manual overrides for parsed display names
|
||||
"alex@bytebytego.com":
|
||||
display_name: "ByteByteGo"
|
||||
"dan@techtakesweekly.com":
|
||||
display_name: "Tech Takes Weekly"
|
||||
|
||||
# Link processing
|
||||
links:
|
||||
unwrap_redirects: true
|
||||
strip_utm: true
|
||||
tracking_params:
|
||||
- "utm_*"
|
||||
- "fbclid"
|
||||
- "gclid"
|
||||
- "mc_cid"
|
||||
- "mc_eid"
|
||||
redirect_limit: 5
|
||||
read_more_pattern: "(?i)^(read more|continue reading|learn more)$"
|
||||
share_patterns:
|
||||
- "(?i)share"
|
||||
- "(?i)forward to a friend"
|
||||
sponsor_markers:
|
||||
- "(?i)sponsor"
|
||||
- "(?i)sponsored"
|
||||
- "(?i)advertisement"
|
||||
- "(?i)partner"
|
||||
filter_unsubscribe: true
|
||||
filter_social_footer: true
|
||||
filter_share_links: true
|
||||
merge_read_more: true
|
||||
|
||||
# Categorization
|
||||
categories:
|
||||
# Built-in taxonomy is used by default; extend here
|
||||
custom:
|
||||
- "AI/ML"
|
||||
- "Career"
|
||||
- "Rust"
|
||||
# LLM settings for category inference
|
||||
llm:
|
||||
provider: "anthropic" # anthropic | openai | local | openai-compatible
|
||||
model: "claude-sonnet-4-6"
|
||||
api_key_env: "ANTHROPIC_API_KEY"
|
||||
base_url: null # for local/openai-compatible
|
||||
failure_category: "Uncategorized"
|
||||
|
||||
# Enrichment
|
||||
enrichment:
|
||||
enabled: true
|
||||
concurrency: 3
|
||||
delay_ms: 1500
|
||||
retries: 2
|
||||
timeout_ms: 10000
|
||||
|
||||
# Rate limiting (applies to both Gmail API and enrichment)
|
||||
rate_limit:
|
||||
gmail_qps: 5 # queries per second to Gmail API
|
||||
link_concurrency: 3 # parallel link fetches
|
||||
|
||||
# State
|
||||
state_file: "~/.nlc/state.json"
|
||||
|
||||
# Parsing plugins
|
||||
plugins:
|
||||
substack:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
### Issue Date Override
|
||||
For newsletters where the email arrival date doesn't match the issue date, overrides can be configured:
|
||||
|
||||
```yaml
|
||||
newsletters:
|
||||
"sender@domain.com":
|
||||
display_name: "Newsletter Name"
|
||||
date_override: "subject" # Parse date from subject line
|
||||
date_format: "%B %d, %Y" # Expected date format in subject
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Gmail API │────▶│ Parse HTML │────▶│ Categorize │────▶│ Write Sheet │
|
||||
│ (fetch) │ │ + Extract │ │ (hybrid) │ │ (Phase 1) │
|
||||
└─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ State File │ │ Enrichment │
|
||||
│ (processed │ │ (Phase 2) │
|
||||
│ tracking) │ │ Page titles │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
## Edge Cases
|
||||
|
||||
| Scenario | Behavior |
|
||||
|---|---|
|
||||
| Email is a single image with no links | Skip with warning, log to state |
|
||||
| "View in browser" link instead of content | Fetch the first matching mirror link, extract links from that HTML |
|
||||
| Same link in multiple newsletters | Keep all occurrences, cross-reference via "Also In" column |
|
||||
| Same link multiple times in one issue | Deduplicate per-issue; single row per unique URL |
|
||||
| Link returns 4xx/5xx during enrichment | Move to Dead Links sheet |
|
||||
| Link is paywalled/auth-required | Include in content sheet, mark Page Title + Meta as "[paywall]" |
|
||||
| Link times out or has a network error | Include in content sheet, mark Page Title + Meta as "[unreachable: error_type]" |
|
||||
| Newsletter name > 100 chars | Truncate for sheet name |
|
||||
| Sheet already exists for newsletter | Append new rows, don't overwrite existing data |
|
||||
| Gmail API rate limit | Retry with exponential backoff |
|
||||
| OAuth token expired | Auto-refresh, re-prompt if refresh fails |
|
||||
| Newsletter format changes | Parser falls back to generic HTML extraction |
|
||||
|
||||
## Setup & First Run
|
||||
|
||||
1. **`nlc init`** — Interactive walkthrough:
|
||||
- Authenticate with Gmail (OAuth browser flow)
|
||||
- Authenticate with Google Sheets (if using Sheets output)
|
||||
- Select the Gmail folder/label to process
|
||||
- Configure output location
|
||||
- Test connectivity
|
||||
- Generate `config.yaml`
|
||||
|
||||
2. **`nlc run --dry-run`** — Test with 5 most recent emails
|
||||
|
||||
3. **`nlc run`** — Full processing run
|
||||
|
||||
4. **`nlc run --enrich-only`** — Enrich previously extracted links with page titles
|
||||
@@ -0,0 +1,368 @@
|
||||
# Newsletter Link Catalog — Specification
|
||||
|
||||
## Overview
|
||||
|
||||
A CLI tool that extracts links from newsletters in a designated Gmail folder, categorizes them, enriches them with metadata, and compiles them into a spreadsheet. Each newsletter gets its own sheet, links are organized by issue date and category, and sponsor links are tracked separately.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Language & Runtime
|
||||
- **TypeScript/Node.js** — compiled to a standalone binary via `pkg` or `tsx-bundle`
|
||||
- CLI tool invoked as `nlc run [flags]`
|
||||
|
||||
### Distribution
|
||||
- Standalone binary — no Node runtime required on the host machine
|
||||
- Built and packaged via CI or build script
|
||||
|
||||
### Run Modes
|
||||
- **Manual**: Run `nlc run` on demand with optional date filters
|
||||
- **Scheduled**: Can be run via cron/Task Scheduler for recurring processing
|
||||
- Designed for both; no daemon mode required
|
||||
|
||||
## Gmail Integration
|
||||
|
||||
### Authentication
|
||||
- **OAuth2 browser flow** — user authorizes via browser, tokens persisted locally
|
||||
- `nlc init` command walks through OAuth setup interactively
|
||||
|
||||
### Scope
|
||||
- Processes emails from a **single designated Gmail folder/label** (configured in `config.yaml`)
|
||||
- Does not scan the entire inbox or search by sender patterns
|
||||
|
||||
### Email Processing
|
||||
- **HTML only** — plain-text parts are ignored
|
||||
- **Image-only emails** (single image, no extractable links) are skipped with a warning logged
|
||||
- **"View in browser" emails** — fetches the web version's HTML and extracts links from that instead
|
||||
- Incremental by default: tracks processed Message-IDs in a local state file, only processes new emails
|
||||
- `--full` flag forces reprocessing of all emails
|
||||
|
||||
## Link Extraction & Processing
|
||||
|
||||
### Extraction Pipeline
|
||||
1. Fetch emails from the configured Gmail folder (incremental or full)
|
||||
2. Parse HTML to extract links, section headers, and surrounding text
|
||||
3. Filter out noise links: unsubscribe, social footer icons, "share this newsletter" links
|
||||
4. Unwrap tracking redirects and strip UTM parameters — store only the clean canonical URL
|
||||
5. Merge "Read more" links with their preceding content (detected by: same URL + "read more" anchor text)
|
||||
6. Categorize each link (see Categorization section)
|
||||
7. Write to spreadsheet (see Output section)
|
||||
|
||||
### Noise Filtering
|
||||
The following link types are **excluded** from content sheets:
|
||||
- Unsubscribe links
|
||||
- Social media footer links (Twitter, LinkedIn, etc.)
|
||||
- "Share this newsletter" / "Forward to a friend" links
|
||||
- "View in browser" mirror links (content is extracted from the web version instead)
|
||||
|
||||
Sponsor/ad links are **not filtered** — they go to a separate sheet.
|
||||
|
||||
### URL Handling
|
||||
- Unwrap all tracking redirects (Mailchimp, Substack, etc.)
|
||||
- Strip UTM parameters and other tracking query params
|
||||
- Store only the clean canonical URL
|
||||
- Dead/broken links (4xx/5xx during enrichment) are moved to a separate "Dead Links" sheet
|
||||
|
||||
### "Read More" Merging
|
||||
When two consecutive elements point to the same URL and one has "read more" (or similar) anchor text, they are merged into a single entry combining the preceding description text and the link.
|
||||
|
||||
## Categorization
|
||||
|
||||
### Strategy: Hybrid
|
||||
1. **Primary**: Use the newsletter's own section headers (e.g., "Python", "DevOps", "Career") as categories
|
||||
2. **Fallback**: When section headers aren't available or don't cover a link, use rule-based classification (URL patterns + keywords)
|
||||
3. **Final fallback**: LLM-based categorization when rules don't match
|
||||
|
||||
### Category Taxonomy
|
||||
- **LLM-generated** by default — the model assigns categories based on link content
|
||||
- Built-in base taxonomy shipped with the tool for common dev categories (Python, JavaScript, DevOps, Security, etc.)
|
||||
- User can extend via config with custom categories
|
||||
- LLM is instructed to prefer existing categories and only create new ones when nothing fits
|
||||
|
||||
### LLM Provider Support (BYOK)
|
||||
All providers supported, configurable in `config.yaml`:
|
||||
- **Claude/Anthropic** — Anthropic API
|
||||
- **OpenAI/GPT** — OpenAI API
|
||||
- **Local models** — Ollama, LM Studio
|
||||
- **OpenAI-compatible endpoints** — Mistral, Groq, Together, etc.
|
||||
|
||||
Provider config includes: API key, base URL, model name, and optional parameters.
|
||||
|
||||
### Newsletter Parsing: Plugin System
|
||||
- Generic HTML parser as the default
|
||||
- Platform-specific parsers loaded as plugins (detected by URL patterns or email headers)
|
||||
- **Substack** shipped as the first plugin — uses Substack's predictable HTML structure for more reliable extraction
|
||||
- Additional parsers can be added as plugins without modifying core logic
|
||||
|
||||
## Output: Spreadsheet
|
||||
|
||||
### Supported Formats
|
||||
- **Google Sheets** — via Google Sheets API (live, shareable, auto-updated)
|
||||
- **Local Excel (.xlsx)** — written to disk, can be uploaded manually
|
||||
|
||||
Config selects which output(s) to use; both can be active simultaneously.
|
||||
|
||||
### Spreadsheet Name
|
||||
- Fixed name set in `config.yaml` (e.g., "Newsletter Link Catalog")
|
||||
|
||||
### Sheet Naming
|
||||
- Each newsletter gets its own sheet named after the parsed display name from the email's From header
|
||||
- Names truncated to fit Google Sheets' 100-character limit
|
||||
- Special characters replaced as needed for sheet name validity
|
||||
|
||||
### Content Sheet Columns
|
||||
Every row is fully populated (flat table — no blank cells for grouping):
|
||||
|
||||
| Column | Description |
|
||||
|---|---|
|
||||
| Issue Date | Date from email's Date header (overridable per-newsletter) |
|
||||
| Category | Assigned category (from newsletter sections, rules, or LLM) |
|
||||
| Link URL | Clean canonical URL after unwrapping and UTM removal |
|
||||
| Title | Anchor text / headline from the newsletter |
|
||||
| Description | 1-2 sentence description from the newsletter (if present) |
|
||||
| Page Title + Meta | `<title>` and meta description from the destination page (enrichment phase) |
|
||||
| Source Newsletter | Name of the newsletter this link came from |
|
||||
| Also In | Cross-reference: other newsletters that also mentioned this link |
|
||||
|
||||
### Sponsor Sheet (Consolidated)
|
||||
Single sheet named "Sponsored Links" containing sponsor/ad links from all newsletters:
|
||||
|
||||
| Column | Description |
|
||||
|---|
|
||||
| Newsletter | Which newsletter this sponsor link appeared in |
|
||||
| Sponsor | Sponsor name (parsed from newsletter) |
|
||||
| Link | Sponsor's link URL |
|
||||
| Description | Sponsor description from the newsletter |
|
||||
|
||||
### Dead Links Sheet
|
||||
Single sheet named "Dead Links" for links that returned errors during enrichment:
|
||||
|
||||
| Column | Description |
|
||||
|---|
|
||||
| URL | The clean canonical URL |
|
||||
| Status | HTTP status or error type (404, 403, timeout, etc.) |
|
||||
| Source | Newsletter name |
|
||||
| Date | Issue date |
|
||||
|
||||
### Cross-References
|
||||
- Duplicates across newsletters are kept in their respective sheets (all occurrences preserved)
|
||||
- The **Also In** column annotates each row with which other newsletters mentioned the same link and when (e.g., "TLDR Web Dev (Mar 5)")
|
||||
- This enables finding cross-newsletter coverage without a separate consolidated sheet
|
||||
|
||||
### No "All Links" Master Sheet
|
||||
Only per-newsletter content sheets, plus the consolidated Sponsor and Dead Links sheets. No "All Links" aggregation sheet.
|
||||
|
||||
## Enrichment
|
||||
|
||||
### Two-Phase Approach
|
||||
1. **Phase 1 (Store)**: Extract links from newsletters, categorize, and write to spreadsheet with all available in-newsletter metadata
|
||||
2. **Phase 2 (Enrich)**: Separate pass to fetch each link's destination page for `<title>` and meta description
|
||||
|
||||
This keeps the initial run fast and allows enrichment to be run independently.
|
||||
|
||||
### Enrichment Details
|
||||
- Configurable concurrency (safe defaults: 3-5 parallel, 1-2s delay between batches)
|
||||
- Retries on transient failures
|
||||
- Dead links (4xx/5xx) moved to Dead Links sheet
|
||||
- Skip paywalled/auth-required pages (detected by login redirects) — mark with "paywall" status
|
||||
- Progress bar shows enrichment status in real-time
|
||||
|
||||
### Link Liveness
|
||||
- Dead links are **not included** in content sheets — they go to the Dead Links sheet
|
||||
- Paywalled/unreachable links are included in content sheets but flagged in the Page Title + Meta column
|
||||
|
||||
## Processing Model
|
||||
|
||||
### Incremental Processing
|
||||
- Local state file (JSON) tracks processed Message-IDs and enrichment status
|
||||
- On subsequent runs, only new/unprocessed emails are fetched
|
||||
- `--full` flag forces reprocessing of all emails
|
||||
- State file location: `~/.nlc/state.json` (or configured path)
|
||||
|
||||
### Date Filtering
|
||||
- `--from YYYY-MM-DD` and `--to YYYY-MM-DD` — absolute date range
|
||||
- `--last N` (e.g., `--last 30d`, `--last 7d`) — relative date range
|
||||
- Can be combined with incremental processing
|
||||
|
||||
### Dry Run
|
||||
- `--dry-run` processes the most recent X emails (default: 5) without writing to the spreadsheet
|
||||
- Shows what would be extracted, categorized, and written
|
||||
- Useful for testing config changes and parser tweaks
|
||||
|
||||
### Error Handling
|
||||
- **Critical errors** (Gmail auth failure, spreadsheet write failure, config errors) → stop execution
|
||||
- **Individual errors** (one link fails to enrich, one email fails to parse) → log and continue
|
||||
- Summary at end includes error counts and details
|
||||
|
||||
### Progress & Logging
|
||||
- Progress bar during processing (emails fetched, links extracted, enrichment status)
|
||||
- Summary stats at the end: newsletters processed, links extracted, duplicates found, dead links, sponsors, errors
|
||||
|
||||
## CLI Interface
|
||||
|
||||
### Commands
|
||||
|
||||
```
|
||||
nlc init # Interactive setup: OAuth, config file, connectivity test
|
||||
nlc run [flags] # Main processing command
|
||||
```
|
||||
|
||||
### `nlc run` Flags
|
||||
|
||||
| Flag | Description | Default |
|
||||
|---|---|---|
|
||||
| `--full` | Reprocess all emails, not just new ones | false |
|
||||
| `--dry-run [N]` | Process most recent N emails without writing to sheet | 5 |
|
||||
| `--from YYYY-MM-DD` | Process emails from this date | (none) |
|
||||
| `--to YYYY-MM-DD` | Process emails up to this date | (none) |
|
||||
| `--last N` | Process emails from last N days (e.g., `--last 30d`) | (none) |
|
||||
| `--skip-enrich` | Skip the enrichment phase (only extract + categorize) | false |
|
||||
| `--enrich-only` | Only run enrichment on already-extracted links | false |
|
||||
| `--config PATH` | Path to config file | `./config.yaml` |
|
||||
| `--verbose` | Detailed per-email and per-link output | false |
|
||||
|
||||
## Configuration
|
||||
|
||||
### File Format: YAML
|
||||
Location: `./config.yaml` (overridable with `--config`)
|
||||
|
||||
### Sample Structure
|
||||
|
||||
```yaml
|
||||
# Gmail settings
|
||||
gmail:
|
||||
folder: "Newsletters" # Gmail label/folder to process
|
||||
credentials: "~/.nlc/gmail-credentials.json"
|
||||
token: "~/.nlc/gmail-token.json"
|
||||
|
||||
# Output settings
|
||||
output:
|
||||
name: "Newsletter Link Catalog" # Spreadsheet name
|
||||
sheets_api:
|
||||
enabled: true
|
||||
credentials: "~/.nlc/sheets-credentials.json"
|
||||
token: "~/.nlc/sheets-token.json"
|
||||
excel:
|
||||
enabled: true
|
||||
path: "./output/newsletter-catalog.xlsx"
|
||||
|
||||
# Newsletter identification
|
||||
newsletters:
|
||||
# Manual overrides for parsed display names
|
||||
# sender_pattern: "display_name"
|
||||
"alex@bytebytego.com": "ByteByteGo"
|
||||
"dan@techtakesweekly.com": "Tech Takes Weekly"
|
||||
|
||||
# Link processing
|
||||
links:
|
||||
unwrap_redirects: true
|
||||
strip_utm: true
|
||||
filter_unsubscribe: true
|
||||
filter_social_footer: true
|
||||
filter_share_links: true
|
||||
merge_read_more: true
|
||||
|
||||
# Categorization
|
||||
categories:
|
||||
# Built-in taxonomy is used by default; extend here
|
||||
custom:
|
||||
- "AI/ML"
|
||||
- "Career"
|
||||
- "Rust"
|
||||
# LLM settings for category inference
|
||||
llm:
|
||||
provider: "anthropic" # anthropic | openai | local | openai-compatible
|
||||
model: "claude-sonnet-4-6"
|
||||
api_key_env: "ANTHROPIC_API_KEY" # or set in env
|
||||
base_url: null # for local/openai-compatible
|
||||
fallback_to_rules: true # if LLM fails, use rule-based
|
||||
|
||||
# Enrichment
|
||||
enrichment:
|
||||
enabled: true
|
||||
concurrency: 3
|
||||
delay_ms: 1500
|
||||
retries: 2
|
||||
timeout_ms: 10000
|
||||
|
||||
# Rate limiting (applies to both Gmail API and enrichment)
|
||||
rate_limit:
|
||||
gmail_qps: 5 # queries per second to Gmail API
|
||||
link_concurrency: 3 # parallel link fetches
|
||||
|
||||
# State
|
||||
state_file: "~/.nlc/state.json"
|
||||
|
||||
# Parsing plugins
|
||||
plugins:
|
||||
substack:
|
||||
enabled: true
|
||||
```
|
||||
|
||||
### Issue Date Override
|
||||
For newsletters where the email arrival date doesn't match the issue date, overrides can be configured:
|
||||
|
||||
```yaml
|
||||
newsletters:
|
||||
"sender@domain.com":
|
||||
display_name: "Newsletter Name"
|
||||
date_override: "subject" # Parse date from subject line
|
||||
date_format: "%B %d, %Y" # Expected date format in subject
|
||||
```
|
||||
|
||||
## Data Flow
|
||||
|
||||
```
|
||||
┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
|
||||
│ Gmail API │────▶│ Parse HTML │────▶│ Categorize │────▶│ Write Sheet │
|
||||
│ (fetch) │ │ + Extract │ │ (hybrid) │ │ (Phase 1) │
|
||||
└─────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌──────────────┐ ┌──────────────┐
|
||||
│ State File │ │ Enrichment │
|
||||
│ (processed │ │ (Phase 2) │
|
||||
│ tracking) │ │ Page titles │
|
||||
└──────────────┘ └──────────────┘
|
||||
```
|
||||
|
||||
## Edge Cases
|
||||
|
||||
| Scenario | Behavior |
|
||||
|---|---|
|
||||
| Email is a single image with no links | Skip with warning, log to state |
|
||||
| "View in browser" link instead of content | Fetch web version HTML, extract links from that |
|
||||
| Same link in multiple newsletters | Keep all occurrences, cross-reference via "Also In" column |
|
||||
| Same link multiple times in one issue | Deduplicate per-issue; single row per unique URL |
|
||||
| Link returns 4xx/5xx during enrichment | Move to Dead Links sheet |
|
||||
| Link is paywalled/auth-required | Include in content sheet, mark Page Title as "[paywall]" |
|
||||
| Newsletter name > 100 chars | Truncate for sheet name |
|
||||
| Sheet already exists for newsletter | Append new rows, don't overwrite existing data |
|
||||
| Gmail API rate limit | Retry with exponential backoff |
|
||||
| OAuth token expired | Auto-refresh, re-prompt if refresh fails |
|
||||
| Newsletter format changes | Parser falls back to generic HTML extraction |
|
||||
|
||||
## Setup & First Run
|
||||
|
||||
1. **`nlc init`** — Interactive walkthrough:
|
||||
- Authenticate with Gmail (OAuth browser flow)
|
||||
- Authenticate with Google Sheets (if using Sheets output)
|
||||
- Select the Gmail folder/label to process
|
||||
- Configure output location
|
||||
- Test connectivity
|
||||
- Generate `config.yaml`
|
||||
|
||||
2. **`nlc run --dry-run`** — Test with 5 most recent emails
|
||||
|
||||
3. **`nlc run`** — Full processing run
|
||||
|
||||
4. **`nlc run --enrich-only`** — Enrich previously extracted links with page titles
|
||||
|
||||
## Future Considerations
|
||||
|
||||
These are **not** in scope for v1 but noted for potential future work:
|
||||
- Search/filter functionality within the spreadsheet
|
||||
- Web UI for browsing the catalog
|
||||
- Email forwarding as an alternative to Gmail API access
|
||||
- Automatic category taxonomy refinement based on accumulated data
|
||||
- Plugin system for additional newsletter platforms beyond Substack
|
||||
- Notification on new newsletter processing
|
||||
Reference in New Issue
Block a user