vt-c-web-capture¶
Capture a web article as markdown and save it to the toolkit intake inbox for later qualification and evaluation.
Plugin: core-standards
Category: Other
Command: /vt-c-web-capture
/vt-c-web-capture — Web Article Capture¶
Fetch a web article by URL, extract metadata, preview for approval, and save as markdown to the toolkit intake inbox.
When to Use¶
- You found an interesting article about Claude Code patterns, MCP servers, AI workflows, or related topics
- You want to capture external knowledge for later qualification via
/vt-c-inbox-qualify - You need a fast path from "this is useful" to "it's in the inbox"
Invocation¶
Example:
Prerequisites¶
- The inbox directory must exist:
TOOLKIT_ROOT/intake/inbox/ - Optional: Playwright installed for Cloudflare-protected sites:
npx playwright install chromium
Execution Instructions¶
Step 1: Validate Input¶
- Check if a URL argument was provided.
If no URL argument:
Usage: /vt-c-web-capture <URL>
Provide the full URL of the article to capture.
Example: /vt-c-web-capture https://example.com/interesting-article
- Check that the URL starts with
http://orhttps://.
If malformed URL:
Invalid URL format. Provide a full URL starting with http:// or https://
Example: /vt-c-web-capture https://example.com/article
Step 2: Fetch Article (Tiered Fallback)¶
This step uses a three-tier fallback chain to maximize capture success:
Tier 1: WebFetch (built-in, fast, works for most sites)
↓ fails (403, Cloudflare, empty)
Tier 2: Playwright (real browser, bypasses Cloudflare)
↓ not installed or fails
Tier 3: Clipboard paste (user copies from browser)
Tier 1: WebFetch¶
- Use WebFetch to retrieve the article content:
WebFetch:
url: <the provided URL>
prompt: |
Extract the following from this page:
1. TITLE: The article title (from <title>, <h1>, or og:title)
2. AUTHOR: The author name (from byline, author meta tag, or "Unknown")
3. BODY: The full article content as clean markdown. Preserve headings,
code blocks, lists, and links. Remove navigation, ads, footers,
and sidebar content.
Return in this exact format:
TITLE: [title here]
AUTHOR: [author here]
BODY:
[full article markdown here]
- Check the response.
If WebFetch succeeds (returns article content with a body longer than a few sentences): Continue to Step 3.
If WebFetch reports a cross-host redirect: Report the redirect to the user and make a second WebFetch request to the redirect URL.
Non-article URLs (GitHub repos, API docs index pages, etc.): Accept whatever WebFetch returns. The content may be less structured but is still useful as reference material.
If WebFetch fails (403 error, Cloudflare challenge page, empty response, or body contains only "Just a moment..." / challenge HTML):
Proceed to Tier 2.Tier 2: Playwright¶
- Check if Playwright is available:
If Playwright is not installed:
Playwright not available. Install with: npx playwright install chromium
Falling back to clipboard capture...
- If Playwright is available, fetch the page with a real browser:
npx playwright test --browser chromium -c - <<'SCRIPT'
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(process.env.TARGET_URL, { waitUntil: 'networkidle', timeout: 30000 });
// Wait for content to render (Cloudflare challenge, lazy loading)
await page.waitForTimeout(3000);
const title = await page.title();
const content = await page.evaluate(() => {
// Try to find the article body
const article = document.querySelector('article') || document.querySelector('[role="main"]') || document.body;
return article.innerText;
});
console.log('TITLE: ' + title);
console.log('AUTHOR: Unknown');
console.log('BODY:');
console.log(content);
await browser.close();
})();
SCRIPT
IMPORTANT: The above is a conceptual reference. In practice, run the Playwright fetch as a single Bash command. Use node -e with an inline script:
TARGET_URL="<the URL>" node -e "
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto(process.env.TARGET_URL, { waitUntil: 'networkidle', timeout: 30000 });
await page.waitForTimeout(3000);
const title = await page.title();
const content = await page.evaluate(() => {
const el = document.querySelector('article') || document.querySelector('[role=\"main\"]') || document.body;
return el.innerText;
});
console.log('TITLE: ' + title);
console.log('AUTHOR: Unknown');
console.log('BODY:');
console.log(content);
await browser.close();
})();
"
- Check the Playwright output.
If Playwright succeeds (returns content with a body): Continue to Step 3 using the Playwright output.
If Playwright fails (timeout, crash, still blocked):
Proceed to Tier 3.Tier 3: Clipboard Paste¶
When both WebFetch and Playwright fail, fall back to manual content capture.
-
Display:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Clipboard Capture Mode ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Automated fetch failed for this URL. You can still capture the article by pasting its content. Instructions: 1. Open the article in your browser 2. Select all article text (Cmd+A or select the article body) 3. Copy it (Cmd+C) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -
Read the clipboard content via Bash:
-
Check the clipboard content.
If clipboard is empty or contains non-article content (< 50 words): Use AskUserQuestion: - Try again — User copies the content and we re-read clipboard - Cancel — Abort capture
If clipboard has content: Use it as the article body. The URL is already known from the argument. Title and author will be extracted from the pasted content (or use fallbacks from Step 3) in the next step.
- Display: Continue to Step 3.
Step 3: Extract Metadata¶
From the WebFetch response:
-
Title: Parse from the
TITLE:line in the response. If no clear title is found, use the domain name + URL path as a fallback (e.g.,docs.anthropic.com — claude-code overview). -
Author: Parse from the
AUTHOR:line. Default to"Unknown"if not found. -
Tags: Generate 3-5 topic tags from the article content. Use lowercase, hyphenated terms relevant to the toolkit (e.g.,
claude-code,mcp-servers,skill-patterns,prompt-engineering). -
Word count: Count words in the article body.
-
Filename: Generate in format
YYYY-MM-DD {slug}.mdwhere: YYYY-MM-DDis today's date{slug}is a brief, lowercased, hyphenated version of the title- Maximum 50 characters for the slug portion
- Truncate at a word boundary (do not cut words in half)
Step 4: Check for Duplicates¶
-
Use Glob to check for existing files in the inbox with the same date prefix:
-
If a file with a very similar slug already exists:
Use AskUserQuestion:
- Overwrite — Replace the existing file
- Rename — Append a suffix (e.g., -2) to the filename
- Cancel — Abort without writing
Step 5: Show Preview¶
Display the extracted metadata for user review:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Article Preview
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Title: [extracted title]
Author: [extracted author]
Tags: [tag1, tag2, tag3, tag4]
Words: [word count]
Filename: [proposed filename]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
If word count is less than 200:
Warning: Article body is very short (N words). This may indicate
paywall-restricted content or a page that didn't render fully.
Step 6: User Confirmation¶
Use AskUserQuestion: - Save — Write the file as previewed - Edit — Adjust title, tags, or filename before saving - Cancel — Abort without writing
If user chooses "Edit": Use AskUserQuestion to ask what to change: - Title — Enter a new title (regenerates the filename slug) - Tags — Enter new comma-separated tags - Filename — Enter a custom filename
After edits, re-display the updated preview and ask for confirmation again.
If user chooses "Cancel":
Exit without further action.Step 7: Write File¶
- First, verify the inbox directory exists:
If inbox directory does not exist:
Error: Inbox not found at TOOLKIT_ROOT/intake/inbox/
Is V025-claude-toolkit checked out at the expected path?
- Compose the file with YAML frontmatter and article body:
---
title: "[article title]"
author: "[author name]"
date: YYYY-MM-DD
url: "[original URL]"
tags: [tag1, tag2, tag3]
source: "web-capture"
---
[article body as markdown]
-
Write to the absolute path:
-
Verify the file was created:
Step 8: Summary¶
Display confirmation and next steps:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Article Captured
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Saved: intake/inbox/[filename]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
NEXT STEPS:
• Qualify for intake: /vt-c-inbox-qualify
• Capture another: /vt-c-web-capture <URL>
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Edge Cases Reference¶
| ID | Scenario | Handling |
|---|---|---|
| EC-001 | Paywalled article (short content) | Warn about low word count in preview (Step 5), let user decide |
| EC-002 | URL returns 404 or no content | "URL returned no content" error, exit (Step 2) |
| EC-003 | No URL argument provided | Display usage help, exit (Step 1) |
| EC-004 | Malformed URL (missing protocol) | Format hint, exit (Step 1) |
| EC-005 | Cross-host redirect | Report redirect, follow with second WebFetch (Step 2) |
| EC-006 | Very long title (>100 chars) | Truncate slug to 50 chars at word boundary (Step 3) |
| EC-007 | Duplicate filename in inbox | Warn, offer overwrite/rename/cancel (Step 4) |
| EC-008 | Inbox directory missing | Error with path check hint, exit (Step 7) |
| EC-009 | No discernible article title | Domain + path fallback slug (Step 3) |
| EC-010 | Non-article URL (repo, docs index) | Accept content as-is (Step 2) |
| EC-011 | Cloudflare/WAF blocks WebFetch (403) | Tiered fallback: Playwright → Clipboard (Step 2) |
| EC-012 | Playwright not installed | Skip to clipboard fallback with install hint (Step 2 Tier 2) |
| EC-013 | Clipboard empty or non-article | Ask user to try again or cancel (Step 2 Tier 3) |
Integration Points¶
| Skill | Relationship |
|---|---|
/vt-c-inbox-qualify |
Triages captured articles into knowledge library or intake proposals |
/vt-c-content-evaluate |
Deep evaluation of captured articles against toolkit inventory |
/vt-c-research-ingest |
Also scans intake/inbox/ as a source |
Troubleshooting¶
Cloudflare-blocked sites (Medium, Substack, etc.)¶
The skill automatically falls through the tiered fetch chain: 1. WebFetch tries first (fast, works for most sites) 2. Playwright tries second if WebFetch gets a 403 (real browser, bypasses Cloudflare) 3. Clipboard paste as last resort (user copies from their browser)
To enable Tier 2, install Playwright once:
WebFetch returns garbled content¶
Some sites render poorly through WebFetch. Options: 1. Try the article's reader-mode or print URL if available 2. Let the fallback chain proceed to Playwright (uses a real browser) 3. Accept partial content — it's still searchable in the knowledge library
"Inbox not found" error¶
The inbox path is absolute: TOOLKIT_ROOT/intake/inbox/
If V025-claude-toolkit is checked out at a different location, the path in this skill needs updating.
Open Brain Capture (Optional)¶
After saving the captured article, if the capture_thought MCP tool is available, ingest it to Open Brain.
When to capture: After every successfully saved article.
How:
1. Check if capture_thought tool is available. If not: skip silently.
2. If the article content is empty: skip capture.
3. Call capture_thought with:
thought: "Research: {article title}. Source: {URL}. Key findings: {2-3 sentence summary}. Relevance: {toolkit/domain/general}."
This step is always last — it never interrupts the web-capture workflow.