web_fetch¶
Web page fetching and content extraction module.
Overview¶
The web_fetch module provides async functions for fetching web pages and extracting their content in various formats optimized for different use cases.
Extraction Modes¶
| Mode | Description | Use Case |
|---|---|---|
markdown |
LLM-friendly markdown via markitdown | AI/LLM consumption, preserves structure |
article |
Plain text via trafilatura | News articles, blog posts |
raw |
Raw HTML (truncated) | HTML analysis, debugging |
metadata |
Title, description, OG tags | Link previews, SEO analysis |
Quick Example¶
import asyncio
from pvlwebtools import web_fetch, FetchConfig
async def main():
# Basic usage
result = await web_fetch("https://example.com")
print(result.content)
# With custom config
config = FetchConfig(max_markdown_length=50_000)
result = await web_fetch("https://example.com", config=config)
asyncio.run(main())
API Reference¶
web_fetch
¶
Web page fetching and content extraction.
This module provides async functions for fetching web pages and extracting their content in various formats optimized for different use cases.
Extraction Modes
- markdown: LLM-friendly markdown via markitdown (preserves structure)
- article: Plain text article extraction via trafilatura
- raw: Raw HTML content (truncated)
- metadata: Page metadata (title, description, Open Graph tags)
Example
import asyncio from pvlwebtools import web_fetch
async def main(): ... result = await web_fetch("https://example.com") ... print(result.content) ... asyncio.run(main())
Configuration
Use :class:FetchConfig to customize behavior:
from pvlwebtools.web_fetch import web_fetch, FetchConfig
config = FetchConfig(max_markdown_length=50_000) result = await web_fetch("https://example.com", config=config)
ExtractMode = Literal['markdown', 'article', 'raw', 'metadata']
module-attribute
¶
Type alias for extraction mode options.
DEFAULT_CONFIG = FetchConfig()
module-attribute
¶
FetchConfig
dataclass
¶
Configuration for web fetching behavior.
This class allows customization of various limits and settings used during web page fetching and content extraction.
Attributes:
| Name | Type | Description |
|---|---|---|
max_markdown_length |
int
|
Maximum characters for markdown output. Content exceeding this limit is truncated with a notice. Default: 100,000 characters. |
max_article_length |
int
|
Maximum characters for article text output. Default: 20,000 characters. |
max_raw_length |
int
|
Maximum characters for raw HTML output. Default: 50,000 characters. |
max_content_length |
int
|
Maximum bytes to download from a URL. Requests for larger content will raise WebFetchError. Default: 1,000,000 bytes (1 MB). |
request_timeout |
float
|
HTTP request timeout in seconds. Default: 15.0 seconds. |
min_request_interval |
float
|
Minimum seconds between requests (rate limiting). Default: 3.0 seconds. |
user_agent |
str
|
User-Agent header for HTTP requests. |
Example
config = FetchConfig( ... max_markdown_length=50_000, ... request_timeout=30.0, ... ) result = await web_fetch(url, config=config)
Source code in src/pvlwebtools/web_fetch.py
FetchResult
dataclass
¶
Result from fetching and extracting content from a URL.
Attributes:
| Name | Type | Description |
|---|---|---|
url |
str
|
The URL that was fetched. |
content |
str
|
The extracted content (format depends on extract_mode). |
content_length |
int
|
Length of the extracted content in characters. |
extract_mode |
ExtractMode
|
The extraction mode that was actually used. May differ from requested mode if fallback occurred. |
Example
result = await web_fetch("https://example.com") print(f"Fetched {result.content_length} chars as {result.extract_mode}")
Source code in src/pvlwebtools/web_fetch.py
WebFetchError
¶
Bases: Exception
Exception raised when web fetching fails.
This exception is raised for various failure conditions including: - Invalid URLs (empty or wrong scheme) - HTTP errors (4xx, 5xx responses) - Content too large - Network timeouts - Connection failures
Attributes:
| Name | Type | Description |
|---|---|---|
message |
Human-readable error description. |
Example
try: ... result = await web_fetch("https://invalid.example") ... except WebFetchError as e: ... print(f"Fetch failed: {e}")
Source code in src/pvlwebtools/web_fetch.py
web_fetch(url, extract_mode='markdown', rate_limit=True, config=None)
async
¶
Fetch and extract content from a URL.
This is the main entry point for web content fetching. It handles the full lifecycle of fetching a URL and extracting its content in a format suitable for various use cases.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
url
|
str
|
URL to fetch. Must start with |
required |
extract_mode
|
ExtractMode
|
How to extract and format content:
|
'markdown'
|
rate_limit
|
bool
|
Whether to enforce minimum interval between requests.
Default |
True
|
config
|
FetchConfig | None
|
Configuration for limits and timeouts. Uses
:data: |
None
|
Returns:
| Type | Description |
|---|---|
FetchResult
|
class: |
Raises:
| Type | Description |
|---|---|
WebFetchError
|
If the URL is invalid, the request fails, or content exceeds configured limits. |
Example
result = await web_fetch("https://example.com") print(result.content[:100])
With custom config:
config = FetchConfig(max_markdown_length=50_000) result = await web_fetch("https://example.com", config=config)
Source code in src/pvlwebtools/web_fetch.py
399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 | |