Corpus API¶

The Corpus class is the main interface for searching the IF Craft Corpus.

Basic Usage¶

from ifcraftcorpus import Corpus

corpus = Corpus()
results = corpus.search("dialogue subtext")

Class Reference¶

Corpus¶

class Corpus:
    def __init__(
        self,
        *,
        corpus_dir: Path | None = None,
        index_path: Path | None = None,
        embeddings_path: Path | None = None,
        use_bundled: bool = True,
    ) -> None

Parameters¶

Parameter	Type	Default	Description
`corpus_dir`	`Path \\| None`	`None`	Path to corpus markdown files. Uses bundled corpus if None.
`index_path`	`Path \\| None`	`None`	Path to pre-built SQLite index. Builds in-memory if None.
`embeddings_path`	`Path \\| None`	`None`	Path to pre-built embeddings. Disables semantic search if None.
`use_bundled`	`bool`	`True`	Use bundled corpus files when `corpus_dir` is None.

Methods¶

search¶

def search(
    self,
    query: str,
    *,
    cluster: str | None = None,
    limit: int = 10,
    mode: Literal["keyword", "semantic", "hybrid"] = "keyword",
) -> list[CorpusResult]

Search the corpus for matching content.

Parameters:

Parameter	Type	Default	Description
`query`	`str`	required	Search query text
`cluster`	`str \\| None`	`None`	Filter results to specific cluster
`limit`	`int`	`10`	Maximum results to return
`mode`	`str`	`"keyword"`	Search mode: `keyword`, `semantic`, or `hybrid`

Returns: List of CorpusResult objects.

Example:

# Basic keyword search
results = corpus.search("character voice")

# Filter by cluster
results = corpus.search("tension", cluster="emotional-design", limit=5)

# Semantic search (requires embeddings)
results = corpus.search("scary atmosphere", mode="semantic")

get_document¶

def get_document(self, name: str) -> dict | None

Retrieve a specific document by name.

Parameters:

Parameter	Type	Description
`name`	`str`	Document name (filename stem without `.md`)

Returns: Document dict with keys: name, path, title, summary, cluster, topics, sections. Returns None if not found.

Example:

doc = corpus.get_document("dialogue_craft")
if doc:
    print(f"Title: {doc['title']}")
    for section in doc['sections']:
        print(f"  - {section['heading']}")

list_documents¶

def list_documents(self) -> list[dict[str, str]]

List all documents in the corpus.

Returns: List of dicts with keys: name, title, cluster, topics.

list_clusters¶

def list_clusters(self) -> list[str]

Get all available cluster names.

Returns: Sorted list of cluster name strings.

document_count¶

def document_count(self) -> int

Get total number of documents in the corpus.

has_semantic_search¶

@property
def has_semantic_search(self) -> bool

Check if semantic search is available (embeddings loaded).

close¶

def close(self) -> None

Close database connections and release resources.

CorpusResult¶

Search results are returned as CorpusResult dataclass instances.

@dataclass
class CorpusResult:
    document_name: str      # Name of source document
    title: str              # Document title
    cluster: str            # Topic cluster
    section_heading: str | None  # Section heading (if from section)
    content: str            # Matched content
    score: float            # Relevance score
    topics: list[str]       # Document topics
    search_type: Literal["keyword", "semantic"]

Properties¶

source¶

@property
def source(self) -> str

Human-readable source reference, e.g., "dialogue_craft > Subtext Techniques".

Context Manager¶

The Corpus class supports context manager protocol for automatic cleanup:

with Corpus() as corpus:
    results = corpus.search("pacing")
    # ... process results
# Resources automatically released