Web Scraper Node
Extract content from any webpage.
Overview
The Web Scraper node fetches and extracts content from web pages. It supports CSS selectors, multiple output formats, and metadata extraction.
Configuration
| Field | Description | Required |
|---|---|---|
URL |
The webpage URL to scrape (supports variables) | Yes |
Output Format |
Plain Text, HTML, Markdown, or JSON | Yes |
Max Words |
Limit the output length (for text/markdown only) | No |
Output Variable |
Variable name to store the scraped content | Yes |
Output Formats
| Format | Description |
|---|---|
Plain Text |
Clean text content, stripped of HTML |
HTML |
Raw HTML content |
Markdown |
Content converted to Markdown format |
JSON (Structured) |
Structured data with metadata |
Advanced Options
| Field | Description | Default |
|---|---|---|
CSS Selector |
Target specific elements (e.g., .content, #main) |
None |
Extract Metadata |
Include page title, description, etc. | Off |
Extract Links |
Collect all links from the page | Off |
Extract Images |
Collect all image URLs | Off |
Timeout |
Request timeout in milliseconds | 30000 |
User Agent |
Custom user agent string | Default |
CSS Selectors
Target specific page elements:
.article-content → Elements with class "article-content"
#main-content → Element with ID "main-content"
article p → All paragraphs inside article tags
[data-type="post"] → Elements with specific data attribute
Using Variables in URL
https://example.com/page/{{page_number}}
https://api.site.com/search?q={{search_term}}
Example Output (JSON format)
{
"content": "Page content here...",
"title": "Page Title",
"description": "Meta description",
"links": ["https://...", "https://..."],
"images": ["https://...", "https://..."]
}