vt-c-web-capture¶

Capture a web article as markdown and save it to the toolkit intake inbox for later qualification and evaluation.

Plugin: core-standards
Category: Other
Command: /vt-c-web-capture

/vt-c-web-capture — Web Article Capture¶

Fetch a web article by URL, extract metadata, preview for approval, and save as markdown to the toolkit intake inbox.

When to Use¶

You found an interesting article about Claude Code patterns, MCP servers, AI workflows, or related topics
You want to capture external knowledge for later qualification via /vt-c-inbox-qualify
You need a fast path from "this is useful" to "it's in the inbox"

Invocation¶

/vt-c-web-capture <URL>

Example:

/vt-c-web-capture https://docs.anthropic.com/en/docs/claude-code/overview

Prerequisites¶

The inbox directory must exist: TOOLKIT_ROOT/intake/inbox/
Optional: Playwright installed for Cloudflare-protected sites: npx playwright install chromium

Execution Instructions¶

Step 1: Validate Input¶

Check if a URL argument was provided.

If no URL argument:

Usage: /vt-c-web-capture <URL>

Provide the full URL of the article to capture.
Example: /vt-c-web-capture https://example.com/interesting-article

Exit without further action.

Check that the URL starts with http:// or https://.

If malformed URL:

Invalid URL format. Provide a full URL starting with http:// or https://
Example: /vt-c-web-capture https://example.com/article

Exit without further action.

Step 2: Fetch Article (Tiered Fallback)¶

This step uses a three-tier fallback chain to maximize capture success:

Tier 1: WebFetch (built-in, fast, works for most sites)
  ↓ fails (403, Cloudflare, empty)
Tier 2: Playwright (real browser, bypasses Cloudflare)
  ↓ not installed or fails
Tier 3: Clipboard paste (user copies from browser)

Tier 1: WebFetch¶

Use WebFetch to retrieve the article content:

WebFetch:
  url: <the provided URL>
  prompt: |
    Extract the following from this page:
    1. TITLE: The article title (from <title>, <h1>, or og:title)
    2. AUTHOR: The author name (from byline, author meta tag, or "Unknown")
    3. BODY: The full article content as clean markdown. Preserve headings,
       code blocks, lists, and links. Remove navigation, ads, footers,
       and sidebar content.

    Return in this exact format:
    TITLE: [title here]
    AUTHOR: [author here]
    BODY:
    [full article markdown here]

Check the response.

If WebFetch succeeds (returns article content with a body longer than a few sentences): Continue to Step 3.

If WebFetch reports a cross-host redirect: Report the redirect to the user and make a second WebFetch request to the redirect URL.

Non-article URLs (GitHub repos, API docs index pages, etc.): Accept whatever WebFetch returns. The content may be less structured but is still useful as reference material.

If WebFetch fails (403 error, Cloudflare challenge page, empty response, or body contains only "Just a moment..." / challenge HTML):

WebFetch blocked (likely Cloudflare protection). Trying Playwright...

Proceed to Tier 2.

Tier 2: Playwright¶

Check if Playwright is available:

npx playwright --version 2>/dev/null

If Playwright is not installed:

Playwright not available. Install with: npx playwright install chromium
Falling back to clipboard capture...

Skip to Tier 3.

If Playwright is available, fetch the page with a real browser:

npx playwright test --browser chromium -c - <<'SCRIPT'
const { chromium } = require('playwright');
(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto(process.env.TARGET_URL, { waitUntil: 'networkidle', timeout: 30000 });
  // Wait for content to render (Cloudflare challenge, lazy loading)
  await page.waitForTimeout(3000);
  const title = await page.title();
  const content = await page.evaluate(() => {
    // Try to find the article body
    const article = document.querySelector('article') || document.querySelector('[role="main"]') || document.body;
    return article.innerText;
  });
  console.log('TITLE: ' + title);
  console.log('AUTHOR: Unknown');
  console.log('BODY:');
  console.log(content);
  await browser.close();
})();
SCRIPT

IMPORTANT: The above is a conceptual reference. In practice, run the Playwright fetch as a single Bash command. Use node -e with an inline script:

TARGET_URL="<the URL>" node -e "
const { chromium } = require('playwright');
(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto(process.env.TARGET_URL, { waitUntil: 'networkidle', timeout: 30000 });
  await page.waitForTimeout(3000);
  const title = await page.title();
  const content = await page.evaluate(() => {
    const el = document.querySelector('article') || document.querySelector('[role=\"main\"]') || document.body;
    return el.innerText;
  });
  console.log('TITLE: ' + title);
  console.log('AUTHOR: Unknown');
  console.log('BODY:');
  console.log(content);
  await browser.close();
})();
"

Check the Playwright output.

If Playwright succeeds (returns content with a body): Continue to Step 3 using the Playwright output.

If Playwright fails (timeout, crash, still blocked):

Playwright fetch failed. Falling back to clipboard capture...

Proceed to Tier 3.

Tier 3: Clipboard Paste¶

When both WebFetch and Playwright fail, fall back to manual content capture.

Display:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Clipboard Capture Mode
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Automated fetch failed for this URL. You can still capture the
article by pasting its content.

Instructions:
1. Open the article in your browser
2. Select all article text (Cmd+A or select the article body)
3. Copy it (Cmd+C)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Read the clipboard content via Bash:
```
pbpaste
```
Check the clipboard content.

If clipboard is empty or contains non-article content (< 50 words): Use AskUserQuestion: - Try again — User copies the content and we re-read clipboard - Cancel — Abort capture

If clipboard has content: Use it as the article body. The URL is already known from the argument. Title and author will be extracted from the pasted content (or use fallbacks from Step 3) in the next step.

Display:
```
Captured N words from clipboard.
```
Continue to Step 3.

Step 3: Extract Metadata¶

From the WebFetch response:

Title: Parse from the TITLE: line in the response. If no clear title is found, use the domain name + URL path as a fallback (e.g., docs.anthropic.com — claude-code overview).
Author: Parse from the AUTHOR: line. Default to "Unknown" if not found.
Tags: Generate 3-5 topic tags from the article content. Use lowercase, hyphenated terms relevant to the toolkit (e.g., claude-code, mcp-servers, skill-patterns, prompt-engineering).
Word count: Count words in the article body.
Filename: Generate in format YYYY-MM-DD {slug}.md where:
YYYY-MM-DD is today's date
{slug} is a brief, lowercased, hyphenated version of the title
Maximum 50 characters for the slug portion
Truncate at a word boundary (do not cut words in half)

Step 4: Check for Duplicates¶

Use Glob to check for existing files in the inbox with the same date prefix:
```
Glob: pattern="intake/inbox/YYYY-MM-DD*" path="TOOLKIT_ROOT"
```
If a file with a very similar slug already exists:

Use AskUserQuestion: - Overwrite — Replace the existing file - Rename — Append a suffix (e.g., -2) to the filename - Cancel — Abort without writing

Step 5: Show Preview¶

Display the extracted metadata for user review:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Article Preview
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Title:    [extracted title]
Author:   [extracted author]
Tags:     [tag1, tag2, tag3, tag4]
Words:    [word count]
Filename: [proposed filename]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

If word count is less than 200:

Warning: Article body is very short (N words). This may indicate
paywall-restricted content or a page that didn't render fully.

Step 6: User Confirmation¶

Use AskUserQuestion: - Save — Write the file as previewed - Edit — Adjust title, tags, or filename before saving - Cancel — Abort without writing

If user chooses "Edit": Use AskUserQuestion to ask what to change: - Title — Enter a new title (regenerates the filename slug) - Tags — Enter new comma-separated tags - Filename — Enter a custom filename

After edits, re-display the updated preview and ask for confirmation again.

If user chooses "Cancel":

Capture cancelled. No file written.

Exit without further action.

Step 7: Write File¶

First, verify the inbox directory exists:

Glob: pattern="intake/inbox/" path="TOOLKIT_ROOT"

If inbox directory does not exist:

Error: Inbox not found at TOOLKIT_ROOT/intake/inbox/
Is V025-claude-toolkit checked out at the expected path?

Exit without writing.

Compose the file with YAML frontmatter and article body:

---
title: "[article title]"
author: "[author name]"
date: YYYY-MM-DD
url: "[original URL]"
tags: [tag1, tag2, tag3]
source: "web-capture"
---

[article body as markdown]

Write to the absolute path:

Write: TOOLKIT_ROOT/intake/inbox/[filename]

Verify the file was created:

Glob: pattern="intake/inbox/[filename]" path="TOOLKIT_ROOT"

Step 8: Summary¶

Display confirmation and next steps:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Article Captured
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Saved: intake/inbox/[filename]

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
NEXT STEPS:
• Qualify for intake: /vt-c-inbox-qualify
• Capture another:    /vt-c-web-capture <URL>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Edge Cases Reference¶

ID	Scenario	Handling
EC-001	Paywalled article (short content)	Warn about low word count in preview (Step 5), let user decide
EC-002	URL returns 404 or no content	"URL returned no content" error, exit (Step 2)
EC-003	No URL argument provided	Display usage help, exit (Step 1)
EC-004	Malformed URL (missing protocol)	Format hint, exit (Step 1)
EC-005	Cross-host redirect	Report redirect, follow with second WebFetch (Step 2)
EC-006	Very long title (>100 chars)	Truncate slug to 50 chars at word boundary (Step 3)
EC-007	Duplicate filename in inbox	Warn, offer overwrite/rename/cancel (Step 4)
EC-008	Inbox directory missing	Error with path check hint, exit (Step 7)
EC-009	No discernible article title	Domain + path fallback slug (Step 3)
EC-010	Non-article URL (repo, docs index)	Accept content as-is (Step 2)
EC-011	Cloudflare/WAF blocks WebFetch (403)	Tiered fallback: Playwright → Clipboard (Step 2)
EC-012	Playwright not installed	Skip to clipboard fallback with install hint (Step 2 Tier 2)
EC-013	Clipboard empty or non-article	Ask user to try again or cancel (Step 2 Tier 3)

Integration Points¶

Skill	Relationship
`/vt-c-inbox-qualify`	Triages captured articles into knowledge library or intake proposals
`/vt-c-content-evaluate`	Deep evaluation of captured articles against toolkit inventory
`/vt-c-research-ingest`	Also scans `intake/inbox/` as a source

Troubleshooting¶

Cloudflare-blocked sites (Medium, Substack, etc.)¶

The skill automatically falls through the tiered fetch chain: 1. WebFetch tries first (fast, works for most sites) 2. Playwright tries second if WebFetch gets a 403 (real browser, bypasses Cloudflare) 3. Clipboard paste as last resort (user copies from their browser)

To enable Tier 2, install Playwright once:

npx playwright install chromium

WebFetch returns garbled content¶

Some sites render poorly through WebFetch. Options: 1. Try the article's reader-mode or print URL if available 2. Let the fallback chain proceed to Playwright (uses a real browser) 3. Accept partial content — it's still searchable in the knowledge library

"Inbox not found" error¶

The inbox path is absolute: TOOLKIT_ROOT/intake/inbox/

If V025-claude-toolkit is checked out at a different location, the path in this skill needs updating.

Open Brain Capture (Optional)¶

After saving the captured article, if the capture_thought MCP tool is available, ingest it to Open Brain.

When to capture: After every successfully saved article.

How: 1. Check if capture_thought tool is available. If not: skip silently. 2. If the article content is empty: skip capture. 3. Call capture_thought with:

thought: "Research: {article title}. Source: {URL}. Key findings: {2-3 sentence summary}. Relevance: {toolkit/domain/general}."

4. On timeout or error: log debug message and continue. Never fail the skill.

This step is always last — it never interrupts the web-capture workflow.