web_fetch¶

Web page fetching and content extraction module.

Overview¶

The web_fetch module provides async functions for fetching web pages and extracting their content in various formats optimized for different use cases.

Extraction Modes¶

Mode	Description	Use Case
`markdown`	LLM-friendly markdown via markitdown	AI/LLM consumption, preserves structure
`article`	Plain text via trafilatura	News articles, blog posts
`raw`	Raw HTML (truncated)	HTML analysis, debugging
`metadata`	Title, description, OG tags	Link previews, SEO analysis

Quick Example¶

import asyncio
from pvlwebtools import web_fetch, FetchConfig

async def main():
    # Basic usage
    result = await web_fetch("https://example.com")
    print(result.content)

    # With custom config
    config = FetchConfig(max_markdown_length=50_000)
    result = await web_fetch("https://example.com", config=config)

asyncio.run(main())

API Reference¶

`web_fetch` ¶

Web page fetching and content extraction.

This module provides async functions for fetching web pages and extracting their content in various formats optimized for different use cases.

Extraction Modes

markdown: LLM-friendly markdown via markitdown (preserves structure)
article: Plain text article extraction via trafilatura
raw: Raw HTML content (truncated)
metadata: Page metadata (title, description, Open Graph tags)

Example

import asyncio from pvlwebtools import web_fetch

async def main(): ... result = await web_fetch("https://example.com") ... print(result.content) ... asyncio.run(main())

Configuration

Use :class:FetchConfig to customize behavior:

from pvlwebtools.web_fetch import web_fetch, FetchConfig

config = FetchConfig(max_markdown_length=50_000) result = await web_fetch("https://example.com", config=config)

`ExtractMode = Literal['markdown', 'article', 'raw', 'metadata']` `module-attribute` ¶

Type alias for extraction mode options.

`DEFAULT_CONFIG = FetchConfig()` `module-attribute` ¶

`FetchConfig` `dataclass` ¶

Configuration for web fetching behavior.

This class allows customization of various limits and settings used during web page fetching and content extraction.

Attributes:

Name	Type	Description
`max_markdown_length`	`int`	Maximum characters for markdown output. Content exceeding this limit is truncated with a notice. Default: 100,000 characters.
`max_article_length`	`int`	Maximum characters for article text output. Default: 20,000 characters.
`max_raw_length`	`int`	Maximum characters for raw HTML output. Default: 50,000 characters.
`max_content_length`	`int`	Maximum bytes to download from a URL. Requests for larger content will raise WebFetchError. Default: 1,000,000 bytes (1 MB).
`request_timeout`	`float`	HTTP request timeout in seconds. Default: 15.0 seconds.
`min_request_interval`	`float`	Minimum seconds between requests (rate limiting). Default: 3.0 seconds.
`user_agent`	`str`	User-Agent header for HTTP requests.

Example

config = FetchConfig( ... max_markdown_length=50_000, ... request_timeout=30.0, ... ) result = await web_fetch(url, config=config)

Source code in src/pvlwebtools/web_fetch.py

@dataclass
class FetchConfig:
    """Configuration for web fetching behavior.

    This class allows customization of various limits and settings
    used during web page fetching and content extraction.

    Attributes:
        max_markdown_length: Maximum characters for markdown output.
            Content exceeding this limit is truncated with a notice.
            Default: 100,000 characters.
        max_article_length: Maximum characters for article text output.
            Default: 20,000 characters.
        max_raw_length: Maximum characters for raw HTML output.
            Default: 50,000 characters.
        max_content_length: Maximum bytes to download from a URL.
            Requests for larger content will raise WebFetchError.
            Default: 1,000,000 bytes (1 MB).
        request_timeout: HTTP request timeout in seconds.
            Default: 15.0 seconds.
        min_request_interval: Minimum seconds between requests (rate limiting).
            Default: 3.0 seconds.
        user_agent: User-Agent header for HTTP requests.

    Example:
        >>> config = FetchConfig(
        ...     max_markdown_length=50_000,
        ...     request_timeout=30.0,
        ... )
        >>> result = await web_fetch(url, config=config)
    """

    max_markdown_length: int = 100_000
    max_article_length: int = 20_000
    max_raw_length: int = 50_000
    max_content_length: int = 1_000_000
    request_timeout: float = 15.0
    min_request_interval: float = 3.0
    user_agent: str = field(default="pvl-webtools/1.0 (https://github.com/pvliesdonk/pvl-webtools)")

`FetchResult` `dataclass` ¶

Result from fetching and extracting content from a URL.

Attributes:

Name	Type	Description
`url`	`str`	The URL that was fetched.
`content`	`str`	The extracted content (format depends on extract_mode).
`content_length`	`int`	Length of the extracted content in characters.
`extract_mode`	`ExtractMode`	The extraction mode that was actually used. May differ from requested mode if fallback occurred.

Example

result = await web_fetch("https://example.com") print(f"Fetched {result.content_length} chars as {result.extract_mode}")

Source code in src/pvlwebtools/web_fetch.py

@dataclass
class FetchResult:
    """Result from fetching and extracting content from a URL.

    Attributes:
        url: The URL that was fetched.
        content: The extracted content (format depends on extract_mode).
        content_length: Length of the extracted content in characters.
        extract_mode: The extraction mode that was actually used.
            May differ from requested mode if fallback occurred.

    Example:
        >>> result = await web_fetch("https://example.com")
        >>> print(f"Fetched {result.content_length} chars as {result.extract_mode}")
    """

    url: str
    content: str
    content_length: int
    extract_mode: ExtractMode

`WebFetchError` ¶

Bases: Exception

Exception raised when web fetching fails.

This exception is raised for various failure conditions including: - Invalid URLs (empty or wrong scheme) - HTTP errors (4xx, 5xx responses) - Content too large - Network timeouts - Connection failures

Attributes:

Name	Type	Description
`message`		Human-readable error description.

Example

try: ... result = await web_fetch("https://invalid.example") ... except WebFetchError as e: ... print(f"Fetch failed: {e}")

Source code in src/pvlwebtools/web_fetch.py

class WebFetchError(Exception):
    """Exception raised when web fetching fails.

    This exception is raised for various failure conditions including:
    - Invalid URLs (empty or wrong scheme)
    - HTTP errors (4xx, 5xx responses)
    - Content too large
    - Network timeouts
    - Connection failures

    Attributes:
        message: Human-readable error description.

    Example:
        >>> try:
        ...     result = await web_fetch("https://invalid.example")
        ... except WebFetchError as e:
        ...     print(f"Fetch failed: {e}")
    """

    pass

`web_fetch(url, extract_mode='markdown', rate_limit=True, config=None)` `async` ¶

Fetch and extract content from a URL.

This is the main entry point for web content fetching. It handles the full lifecycle of fetching a URL and extracting its content in a format suitable for various use cases.

Parameters:

Name	Type	Description	Default
`url`	`str`	URL to fetch. Must start with `http://` or `https://`.	required
`extract_mode`	`ExtractMode`	How to extract and format content: `'markdown'`: Convert to LLM-friendly markdown (default). Preserves document structure. Falls back to `'article'` if markitdown is not installed. `'article'`: Extract main article text via trafilatura. Good for news articles and blog posts. `'raw'`: Return raw HTML (truncated per config). `'metadata'`: Extract only title, description, OG tags.	`'markdown'`
`rate_limit`	`bool`	Whether to enforce minimum interval between requests. Default `True`. Disable for batch operations with external rate limiting.	`True`
`config`	`FetchConfig \| None`	Configuration for limits and timeouts. Uses :data:`DEFAULT_CONFIG` if not provided.	`None`

Returns:

Type	Description
`FetchResult`	class:`FetchResult` with extracted content and metadata.

Raises:

Type	Description
`WebFetchError`	If the URL is invalid, the request fails, or content exceeds configured limits.

Example

result = await web_fetch("https://example.com") print(result.content[:100])

With custom config:

config = FetchConfig(max_markdown_length=50_000) result = await web_fetch("https://example.com", config=config)

Source code in src/pvlwebtools/web_fetch.py

async def web_fetch(
    url: str,
    extract_mode: ExtractMode = "markdown",
    rate_limit: bool = True,
    config: FetchConfig | None = None,
) -> FetchResult:
    """Fetch and extract content from a URL.

    This is the main entry point for web content fetching. It handles
    the full lifecycle of fetching a URL and extracting its content
    in a format suitable for various use cases.

    Args:
        url: URL to fetch. Must start with ``http://`` or ``https://``.
        extract_mode: How to extract and format content:

            - ``'markdown'``: Convert to LLM-friendly markdown (default).
              Preserves document structure. Falls back to ``'article'``
              if markitdown is not installed.
            - ``'article'``: Extract main article text via trafilatura.
              Good for news articles and blog posts.
            - ``'raw'``: Return raw HTML (truncated per config).
            - ``'metadata'``: Extract only title, description, OG tags.

        rate_limit: Whether to enforce minimum interval between requests.
            Default ``True``. Disable for batch operations with external
            rate limiting.
        config: Configuration for limits and timeouts. Uses
            :data:`DEFAULT_CONFIG` if not provided.

    Returns:
        :class:`FetchResult` with extracted content and metadata.

    Raises:
        WebFetchError: If the URL is invalid, the request fails,
            or content exceeds configured limits.

    Example:
        >>> result = await web_fetch("https://example.com")
        >>> print(result.content[:100])

        With custom config:

        >>> config = FetchConfig(max_markdown_length=50_000)
        >>> result = await web_fetch("https://example.com", config=config)
    """
    if config is None:
        config = DEFAULT_CONFIG

    if not url.strip():
        raise WebFetchError("URL cannot be empty")

    if not url.startswith(("http://", "https://")):
        raise WebFetchError("URL must start with http:// or https://")

    logger.debug(
        "web_fetch start url=%s mode=%s rate_limit=%s",
        _truncate(url),
        extract_mode,
        rate_limit,
    )

    if rate_limit:
        await _enforce_rate_limit(config.min_request_interval)

    try:
        html_content = await _fetch_url(url, config)
        actual_mode = extract_mode

        if extract_mode == "raw":
            content = html_content[: config.max_raw_length]
        elif extract_mode == "metadata":
            content = _extract_metadata(html_content)
        elif extract_mode == "markdown":
            result = _extract_markdown(html_content, config.max_markdown_length)
            if result is not None:
                content = result
            else:
                # Fallback to article extraction
                logger.debug("markitdown unavailable; falling back to article extraction")
                content = _extract_article(html_content, config.max_article_length)
                actual_mode = "article"
        else:  # article
            content = _extract_article(html_content, config.max_article_length)

        fetch_result = FetchResult(
            url=url,
            content=content,
            content_length=len(content),
            extract_mode=actual_mode,
        )

        logger.debug(
            "web_fetch success url=%s mode=%s length=%s",
            _truncate(url),
            fetch_result.extract_mode,
            fetch_result.content_length,
        )

        return fetch_result

    except httpx.HTTPError as e:
        logger.warning("HTTP error fetching %s: %s", _truncate(url), e)
        raise WebFetchError(f"HTTP error: {e}") from e
    except Exception as e:
        logger.warning("Fetch failed for %s: %s", _truncate(url), e)
        raise WebFetchError(f"Fetch failed: {e}") from e

web_fetch¶

Overview¶

Extraction Modes¶

Quick Example¶

API Reference¶

web_fetch ¶

ExtractMode = Literal['markdown', 'article', 'raw', 'metadata'] module-attribute ¶

DEFAULT_CONFIG = FetchConfig() module-attribute ¶

FetchConfig dataclass ¶

FetchResult dataclass ¶

WebFetchError ¶

web_fetch(url, extract_mode='markdown', rate_limit=True, config=None) async ¶

`web_fetch` ¶

`ExtractMode = Literal['markdown', 'article', 'raw', 'metadata']` `module-attribute` ¶

`DEFAULT_CONFIG = FetchConfig()` `module-attribute` ¶

`FetchConfig` `dataclass` ¶

`FetchResult` `dataclass` ¶

`WebFetchError` ¶

`web_fetch(url, extract_mode='markdown', rate_limit=True, config=None)` `async` ¶