Parser API¶
The parser module extracts YAML frontmatter and markdown sections from corpus files.
Functions¶
parse_file¶
Parse a single corpus markdown file.
Parameters:
| Parameter | Type | Description |
|---|---|---|
path |
Path |
Path to the markdown file |
Returns: A Document instance.
Raises: ValueError if the file has no valid frontmatter or invalid YAML.
Example:
from pathlib import Path
from ifcraftcorpus.parser import parse_file
doc = parse_file(Path("corpus/prose-and-language/dialogue_craft.md"))
print(f"Title: {doc.title}")
print(f"Cluster: {doc.cluster}")
print(f"Topics: {', '.join(doc.topics)}")
parse_directory¶
Parse all markdown files in a directory (recursive).
Parameters:
| Parameter | Type | Description |
|---|---|---|
corpus_dir |
Path |
Root directory to search |
Returns: List of Document instances. Files that fail to parse are silently skipped.
Data Classes¶
Document¶
Represents a parsed corpus document.
@dataclass
class Document:
path: Path # Original file path
title: str # From frontmatter
summary: str # From frontmatter
topics: list[str] # From frontmatter
cluster: str # From frontmatter
sections: list[Section] # Extracted sections
content_hash: str # SHA256 hash (first 16 chars)
raw_content: str # Original file content
Properties¶
name¶
Document name (filename stem without extension).
Methods¶
validate¶
Validate the document and return a list of error messages. Returns empty list if valid.
Validation rules:
title: Required, minimum 5 characterssummary: Required, 20-300 characterstopics: Required, minimum 3 topicscluster: Required, must be a valid cluster name
Section¶
Represents an extracted markdown section.
@dataclass
class Section:
heading: str # Section heading text
level: int # Heading level (1=H1, 2=H2, 3=H3)
content: str # Section content (excluding heading)
line_start: int # 1-indexed line number
Valid Clusters¶
The parser validates cluster names against this set:
VALID_CLUSTERS = {
"narrative-structure",
"prose-and-language",
"genre-conventions",
"audience-and-access",
"world-and-setting",
"emotional-design",
"scope-and-planning",
"craft-foundations",
"agent-design",
"game-design",
}
File Format¶
Corpus files must have YAML frontmatter: