Skip to content

Parser API

The parser module extracts YAML frontmatter and markdown sections from corpus files.

Functions

parse_file

def parse_file(path: Path) -> Document

Parse a single corpus markdown file.

Parameters:

Parameter Type Description
path Path Path to the markdown file

Returns: A Document instance.

Raises: ValueError if the file has no valid frontmatter or invalid YAML.

Example:

from pathlib import Path
from ifcraftcorpus.parser import parse_file

doc = parse_file(Path("corpus/prose-and-language/dialogue_craft.md"))
print(f"Title: {doc.title}")
print(f"Cluster: {doc.cluster}")
print(f"Topics: {', '.join(doc.topics)}")

parse_directory

def parse_directory(corpus_dir: Path) -> list[Document]

Parse all markdown files in a directory (recursive).

Parameters:

Parameter Type Description
corpus_dir Path Root directory to search

Returns: List of Document instances. Files that fail to parse are silently skipped.

Data Classes

Document

Represents a parsed corpus document.

@dataclass
class Document:
    path: Path              # Original file path
    title: str              # From frontmatter
    summary: str            # From frontmatter
    topics: list[str]       # From frontmatter
    cluster: str            # From frontmatter
    sections: list[Section] # Extracted sections
    content_hash: str       # SHA256 hash (first 16 chars)
    raw_content: str        # Original file content

Properties

name
@property
def name(self) -> str

Document name (filename stem without extension).

Methods

validate
def validate(self) -> list[str]

Validate the document and return a list of error messages. Returns empty list if valid.

Validation rules:

  • title: Required, minimum 5 characters
  • summary: Required, 20-300 characters
  • topics: Required, minimum 3 topics
  • cluster: Required, must be a valid cluster name

Section

Represents an extracted markdown section.

@dataclass
class Section:
    heading: str    # Section heading text
    level: int      # Heading level (1=H1, 2=H2, 3=H3)
    content: str    # Section content (excluding heading)
    line_start: int # 1-indexed line number

Valid Clusters

The parser validates cluster names against this set:

VALID_CLUSTERS = {
    "narrative-structure",
    "prose-and-language",
    "genre-conventions",
    "audience-and-access",
    "world-and-setting",
    "emotional-design",
    "scope-and-planning",
    "craft-foundations",
    "agent-design",
    "game-design",
}

File Format

Corpus files must have YAML frontmatter:

---
title: Document Title
summary: A brief summary of the document content (20-300 chars)
topics:
  - topic1
  - topic2
  - topic3
cluster: prose-and-language
---

# First Heading

Content here...

## Second Heading

More content...