Newsletter-Link-Catalog/notes/PROMPT.md

/goal

<task>
  You are an autonomous senior engineer working in:
  C:\Users\ksolo\Projects\Misc Projects\Newletter Link Catalog

  Implement the Newsletter Link Catalog CLI described in SPEC.md end-to-end.

  The expected product is a TypeScript/Node.js CLI named `nlc` with:
  - `nlc init`
  - `nlc run [flags]`
  - Gmail OAuth browser auth and local token persistence
  - config-driven Gmail label/folder processing
  - HTML newsletter parsing and link extraction
  - noise filtering, tracking URL cleanup, redirect unwrapping, read-more merging
  - hybrid categorization using section headers, rules, and optional LLM providers
  - parser plugin architecture with a generic parser and Substack plugin
  - Google Sheets and local `.xlsx` outputs
  - incremental state tracking in JSON
  - enrichment pass for page title/meta, dead-link handling, paywall/unreachable markers
  - dry-run, date filtering, full reprocess, skip-enrich, enrich-only, config, and verbose flags
  - standalone binary build script and documentation for the selected bundling tool

  Use SPEC.md as the source of truth. If existing code conflicts with SPEC.md, prefer SPEC.md unless repo-local instructions explicitly require otherwise.

  The repository-level working agreements mention PHP tooling, but this project spec is TypeScript/Node.js. Apply the relevant JS quality rules: ESLint airbnb/base, Prettier, tests, secure input/output
  handling, and CI-style validation. Do not add PHP tooling unless PHP files already exist and require it.
</task>

<goal>
  Build a production-quality CLI that meets the SPEC.md requirements, adheres to the working agreements, and can be used by the repository owner to catalog newsletter links from Gmail into Google Sheets with confidence in correctness, safety, and maintainability.
</goal>

<default_follow_through_policy>
  Default to the most reasonable low-risk interpretation and keep going.
  Only stop to ask when a missing detail changes correctness, safety, external credentials, or an irreversible action.
  When external services or credentials are unavailable, implement the integration boundary, tests, mocks, and clear setup docs instead of blocking.
</default_follow_through_policy>

<completeness_contract>
  Persist until the task is fully handled end-to-end within the current turn whenever feasible: do not stop at analysis or partial fixes.
  Treat the task as incomplete until every major SPEC.md behavior is implemented, tested, documented, or explicitly marked [blocked] with evidence.
  Before finishing, reconcile every plan item: Done, Blocked, or Cancelled. Never leave items in-progress.
  Do not claim completion until validation has run and failures are fixed or explained with concrete blocker evidence.
</completeness_contract>

<missing_context_gating>
  Read SPEC.md and inspect the repository before planning implementation.
  Do not guess repository structure, package manager, test framework, or existing scripts. Retrieve them with tools.
  If the repo is empty or nearly empty, scaffold a TypeScript CLI project using npm unless an existing package manager is present.
  If credentials, live Gmail, Google Sheets, or LLM API keys are missing, use mocks/fakes for automated tests and document the required environment variables and setup.
</missing_context_gating>

<tool_persistence_rules>
  Prefer dedicated tools over raw shell where available: rg, read_file/list_dir equivalents, apply_patch, and update_plan.
  Use rg or rg --files for search.
  Parallelize independent file reads; sequence dependent actions.
  Use apply_patch for manual edits.
  Keep using tools until you have enough evidence to finish confidently.
</tool_persistence_rules>

<implementation_requirements>
  Implement a clean modular architecture, with separate modules for:
  - CLI command parsing
  - config loading and validation
  - Gmail OAuth/auth/client access
  - Gmail message fetching by configured label
  - HTML parsing and extraction
  - noise filtering
  - URL normalization, redirect unwrapping, and tracking parameter stripping
  - categorization
  - LLM provider adapters
  - parser plugins
  - spreadsheet writers
  - enrichment
  - state persistence
  - logging/progress reporting

  Implement provider adapters for:
  - Anthropic
  - OpenAI
  - local/Ollama or LM Studio style endpoints
  - OpenAI-compatible endpoints

  Implement output writers for:
  - Google Sheets API
  - local Excel `.xlsx`

  Implement tests for core behavior without requiring live external services:
  - config validation
  - date filter conflict handling
  - sheet-name sanitization/truncation
  - URL cleanup and tracking parameter stripping
  - read-more link merging
  - noise filtering
  - sponsor detection
  - section-header categorization
  - fallback rule categorization
  - state-file incremental behavior
  - dead/paywall/unreachable enrichment handling
  - dry-run state/write suppression
  - parser plugin selection, including Substack
</implementation_requirements>

<action_safety>
  Keep changes tightly scoped to building this CLI.
  Avoid unrelated refactors, renames, or cleanup.
  Do not run destructive git commands such as reset --hard or checkout -- without explicit approval.
  Never commit secrets, tokens, OAuth credentials, spreadsheet IDs, or user data.
  Persist tokens only in documented local paths such as ~/.nlc.
  Sanitize config and CLI inputs. Escape or safely encode spreadsheet cell values that could become formulas.
  Handle critical errors by stopping with a useful message. Handle individual email/link failures by logging and continuing.
</action_safety>

<verification_loop>
  Required validations:
  - npm install
  - npm run lint
  - npm run format:check
  - npm run typecheck
  - npm test
  - npm run build
  - npm run smoke

  If the package scripts do not exist yet, create them.
  `npm run build` must compile the TypeScript project and produce the standalone binary or packaged executable artifact described in the docs.
  `npm run smoke` must exercise the CLI without live credentials, at minimum:
  - `nlc --help`
  - `nlc init --help`
  - `nlc run --help`
  - a dry-run or fixture-backed run path that proves parsing/output orchestration works without mutating real Gmail or Sheets.

  Before finalizing, run the required validations.
  If a check fails, fix the cause and rerun until green or until a real external blocker remains.
  Report any unavailable live-service validation separately from local automated validation.
</verification_loop>

<progress_updates>
  For long work, give brief progress updates after meaningful milestones:
  - repo inspection complete
  - implementation plan formed
  - core modules scaffolded
  - tests added
  - validations running
  - final validation result
  Keep updates concise and outcome-based.
</progress_updates>

<structured_output_contract>
  Final report exactly in this order:
  1. Summary: 2-4 bullets describing what was built.
  2. Changed files: one line per important file or directory.
  3. Validations: each command run and its result.
  4. Blockers or residual risks: include only real remaining issues.
  5. Next operational steps: credential/setup steps needed for live Gmail or Google Sheets use.

  Keep the final report compact and highest-signal first.
</structured_output_contract>