Skip to content

web_fetch

Web page fetching and content extraction module.

Overview

The web_fetch module provides async functions for fetching web pages and extracting their content in various formats optimized for different use cases.

Extraction Modes

Mode Description Use Case
markdown LLM-friendly markdown via markitdown AI/LLM consumption, preserves structure
article Plain text via trafilatura News articles, blog posts
raw Raw HTML (truncated) HTML analysis, debugging
metadata Title, description, OG tags Link previews, SEO analysis

Quick Example

import asyncio
from pvlwebtools import web_fetch, FetchConfig

async def main():
    # Basic usage
    result = await web_fetch("https://example.com")
    print(result.content)

    # With custom config
    config = FetchConfig(max_markdown_length=50_000)
    result = await web_fetch("https://example.com", config=config)

asyncio.run(main())

API Reference

web_fetch

Web page fetching and content extraction.

This module provides async functions for fetching web pages and extracting their content in various formats optimized for different use cases.

Extraction Modes
  • markdown: LLM-friendly markdown via markitdown (preserves structure)
  • article: Plain text article extraction via trafilatura
  • raw: Raw HTML content (truncated)
  • metadata: Page metadata (title, description, Open Graph tags)
Example

import asyncio from pvlwebtools import web_fetch

async def main(): ... result = await web_fetch("https://example.com") ... print(result.content) ... asyncio.run(main())

Configuration

Use :class:FetchConfig to customize behavior:

from pvlwebtools.web_fetch import web_fetch, FetchConfig

config = FetchConfig(max_markdown_length=50_000) result = await web_fetch("https://example.com", config=config)

ExtractMode = Literal['markdown', 'article', 'raw', 'metadata'] module-attribute

Type alias for extraction mode options.

DEFAULT_CONFIG = FetchConfig() module-attribute

FetchConfig dataclass

Configuration for web fetching behavior.

This class allows customization of various limits and settings used during web page fetching and content extraction.

Attributes:

Name Type Description
max_markdown_length int

Maximum characters for markdown output. Content exceeding this limit is truncated with a notice. Default: 100,000 characters.

max_article_length int

Maximum characters for article text output. Default: 20,000 characters.

max_raw_length int

Maximum characters for raw HTML output. Default: 50,000 characters.

max_content_length int

Maximum bytes to download from a URL. Requests for larger content will raise WebFetchError. Default: 1,000,000 bytes (1 MB).

request_timeout float

HTTP request timeout in seconds. Default: 15.0 seconds.

min_request_interval float

Minimum seconds between requests (rate limiting). Default: 3.0 seconds.

user_agent str

User-Agent header for HTTP requests.

Example

config = FetchConfig( ... max_markdown_length=50_000, ... request_timeout=30.0, ... ) result = await web_fetch(url, config=config)

Source code in src/pvlwebtools/web_fetch.py
@dataclass
class FetchConfig:
    """Configuration for web fetching behavior.

    This class allows customization of various limits and settings
    used during web page fetching and content extraction.

    Attributes:
        max_markdown_length: Maximum characters for markdown output.
            Content exceeding this limit is truncated with a notice.
            Default: 100,000 characters.
        max_article_length: Maximum characters for article text output.
            Default: 20,000 characters.
        max_raw_length: Maximum characters for raw HTML output.
            Default: 50,000 characters.
        max_content_length: Maximum bytes to download from a URL.
            Requests for larger content will raise WebFetchError.
            Default: 1,000,000 bytes (1 MB).
        request_timeout: HTTP request timeout in seconds.
            Default: 15.0 seconds.
        min_request_interval: Minimum seconds between requests (rate limiting).
            Default: 3.0 seconds.
        user_agent: User-Agent header for HTTP requests.

    Example:
        >>> config = FetchConfig(
        ...     max_markdown_length=50_000,
        ...     request_timeout=30.0,
        ... )
        >>> result = await web_fetch(url, config=config)
    """

    max_markdown_length: int = 100_000
    max_article_length: int = 20_000
    max_raw_length: int = 50_000
    max_content_length: int = 1_000_000
    request_timeout: float = 15.0
    min_request_interval: float = 3.0
    user_agent: str = field(default="pvl-webtools/1.0 (https://github.com/pvliesdonk/pvl-webtools)")

FetchResult dataclass

Result from fetching and extracting content from a URL.

Attributes:

Name Type Description
url str

The URL that was fetched.

content str

The extracted content (format depends on extract_mode).

content_length int

Length of the extracted content in characters.

extract_mode ExtractMode

The extraction mode that was actually used. May differ from requested mode if fallback occurred.

Example

result = await web_fetch("https://example.com") print(f"Fetched {result.content_length} chars as {result.extract_mode}")

Source code in src/pvlwebtools/web_fetch.py
@dataclass
class FetchResult:
    """Result from fetching and extracting content from a URL.

    Attributes:
        url: The URL that was fetched.
        content: The extracted content (format depends on extract_mode).
        content_length: Length of the extracted content in characters.
        extract_mode: The extraction mode that was actually used.
            May differ from requested mode if fallback occurred.

    Example:
        >>> result = await web_fetch("https://example.com")
        >>> print(f"Fetched {result.content_length} chars as {result.extract_mode}")
    """

    url: str
    content: str
    content_length: int
    extract_mode: ExtractMode

WebFetchError

Bases: Exception

Exception raised when web fetching fails.

This exception is raised for various failure conditions including: - Invalid URLs (empty or wrong scheme) - HTTP errors (4xx, 5xx responses) - Content too large - Network timeouts - Connection failures

Attributes:

Name Type Description
message

Human-readable error description.

Example

try: ... result = await web_fetch("https://invalid.example") ... except WebFetchError as e: ... print(f"Fetch failed: {e}")

Source code in src/pvlwebtools/web_fetch.py
class WebFetchError(Exception):
    """Exception raised when web fetching fails.

    This exception is raised for various failure conditions including:
    - Invalid URLs (empty or wrong scheme)
    - HTTP errors (4xx, 5xx responses)
    - Content too large
    - Network timeouts
    - Connection failures

    Attributes:
        message: Human-readable error description.

    Example:
        >>> try:
        ...     result = await web_fetch("https://invalid.example")
        ... except WebFetchError as e:
        ...     print(f"Fetch failed: {e}")
    """

    pass

web_fetch(url, extract_mode='markdown', rate_limit=True, config=None) async

Fetch and extract content from a URL.

This is the main entry point for web content fetching. It handles the full lifecycle of fetching a URL and extracting its content in a format suitable for various use cases.

Parameters:

Name Type Description Default
url str

URL to fetch. Must start with http:// or https://.

required
extract_mode ExtractMode

How to extract and format content:

  • 'markdown': Convert to LLM-friendly markdown (default). Preserves document structure. Falls back to 'article' if markitdown is not installed.
  • 'article': Extract main article text via trafilatura. Good for news articles and blog posts.
  • 'raw': Return raw HTML (truncated per config).
  • 'metadata': Extract only title, description, OG tags.
'markdown'
rate_limit bool

Whether to enforce minimum interval between requests. Default True. Disable for batch operations with external rate limiting.

True
config FetchConfig | None

Configuration for limits and timeouts. Uses :data:DEFAULT_CONFIG if not provided.

None

Returns:

Type Description
FetchResult

class:FetchResult with extracted content and metadata.

Raises:

Type Description
WebFetchError

If the URL is invalid, the request fails, or content exceeds configured limits.

Example

result = await web_fetch("https://example.com") print(result.content[:100])

With custom config:

config = FetchConfig(max_markdown_length=50_000) result = await web_fetch("https://example.com", config=config)

Source code in src/pvlwebtools/web_fetch.py
async def web_fetch(
    url: str,
    extract_mode: ExtractMode = "markdown",
    rate_limit: bool = True,
    config: FetchConfig | None = None,
) -> FetchResult:
    """Fetch and extract content from a URL.

    This is the main entry point for web content fetching. It handles
    the full lifecycle of fetching a URL and extracting its content
    in a format suitable for various use cases.

    Args:
        url: URL to fetch. Must start with ``http://`` or ``https://``.
        extract_mode: How to extract and format content:

            - ``'markdown'``: Convert to LLM-friendly markdown (default).
              Preserves document structure. Falls back to ``'article'``
              if markitdown is not installed.
            - ``'article'``: Extract main article text via trafilatura.
              Good for news articles and blog posts.
            - ``'raw'``: Return raw HTML (truncated per config).
            - ``'metadata'``: Extract only title, description, OG tags.

        rate_limit: Whether to enforce minimum interval between requests.
            Default ``True``. Disable for batch operations with external
            rate limiting.
        config: Configuration for limits and timeouts. Uses
            :data:`DEFAULT_CONFIG` if not provided.

    Returns:
        :class:`FetchResult` with extracted content and metadata.

    Raises:
        WebFetchError: If the URL is invalid, the request fails,
            or content exceeds configured limits.

    Example:
        >>> result = await web_fetch("https://example.com")
        >>> print(result.content[:100])

        With custom config:

        >>> config = FetchConfig(max_markdown_length=50_000)
        >>> result = await web_fetch("https://example.com", config=config)
    """
    if config is None:
        config = DEFAULT_CONFIG

    if not url.strip():
        raise WebFetchError("URL cannot be empty")

    if not url.startswith(("http://", "https://")):
        raise WebFetchError("URL must start with http:// or https://")

    logger.debug(
        "web_fetch start url=%s mode=%s rate_limit=%s",
        _truncate(url),
        extract_mode,
        rate_limit,
    )

    if rate_limit:
        await _enforce_rate_limit(config.min_request_interval)

    try:
        html_content = await _fetch_url(url, config)
        actual_mode = extract_mode

        if extract_mode == "raw":
            content = html_content[: config.max_raw_length]
        elif extract_mode == "metadata":
            content = _extract_metadata(html_content)
        elif extract_mode == "markdown":
            result = _extract_markdown(html_content, config.max_markdown_length)
            if result is not None:
                content = result
            else:
                # Fallback to article extraction
                logger.debug("markitdown unavailable; falling back to article extraction")
                content = _extract_article(html_content, config.max_article_length)
                actual_mode = "article"
        else:  # article
            content = _extract_article(html_content, config.max_article_length)

        fetch_result = FetchResult(
            url=url,
            content=content,
            content_length=len(content),
            extract_mode=actual_mode,
        )

        logger.debug(
            "web_fetch success url=%s mode=%s length=%s",
            _truncate(url),
            fetch_result.extract_mode,
            fetch_result.content_length,
        )

        return fetch_result

    except httpx.HTTPError as e:
        logger.warning("HTTP error fetching %s: %s", _truncate(url), e)
        raise WebFetchError(f"HTTP error: {e}") from e
    except Exception as e:
        logger.warning("Fetch failed for %s: %s", _truncate(url), e)
        raise WebFetchError(f"Fetch failed: {e}") from e