Skip to content

Corpus API

The Corpus class is the main interface for searching the IF Craft Corpus.

Basic Usage

from ifcraftcorpus import Corpus

corpus = Corpus()
results = corpus.search("dialogue subtext")

Class Reference

Corpus

class Corpus:
    def __init__(
        self,
        *,
        corpus_dir: Path | None = None,
        index_path: Path | None = None,
        embeddings_path: Path | None = None,
        use_bundled: bool = True,
    ) -> None

Parameters

Parameter Type Default Description
corpus_dir Path \| None None Path to corpus markdown files. Uses bundled corpus if None.
index_path Path \| None None Path to pre-built SQLite index. Builds in-memory if None.
embeddings_path Path \| None None Path to pre-built embeddings. Disables semantic search if None.
use_bundled bool True Use bundled corpus files when corpus_dir is None.

Methods

def search(
    self,
    query: str,
    *,
    cluster: str | None = None,
    limit: int = 10,
    mode: Literal["keyword", "semantic", "hybrid"] = "keyword",
) -> list[CorpusResult]

Search the corpus for matching content.

Parameters:

Parameter Type Default Description
query str required Search query text
cluster str \| None None Filter results to specific cluster
limit int 10 Maximum results to return
mode str "keyword" Search mode: keyword, semantic, or hybrid

Returns: List of CorpusResult objects.

Example:

# Basic keyword search
results = corpus.search("character voice")

# Filter by cluster
results = corpus.search("tension", cluster="emotional-design", limit=5)

# Semantic search (requires embeddings)
results = corpus.search("scary atmosphere", mode="semantic")

get_document

def get_document(self, name: str) -> dict | None

Retrieve a specific document by name.

Parameters:

Parameter Type Description
name str Document name (filename stem without .md)

Returns: Document dict with keys: name, path, title, summary, cluster, topics, sections. Returns None if not found.

Example:

doc = corpus.get_document("dialogue_craft")
if doc:
    print(f"Title: {doc['title']}")
    for section in doc['sections']:
        print(f"  - {section['heading']}")

list_documents

def list_documents(self) -> list[dict[str, str]]

List all documents in the corpus.

Returns: List of dicts with keys: name, title, cluster, topics.

list_clusters

def list_clusters(self) -> list[str]

Get all available cluster names.

Returns: Sorted list of cluster name strings.

document_count

def document_count(self) -> int

Get total number of documents in the corpus.

@property
def has_semantic_search(self) -> bool

Check if semantic search is available (embeddings loaded).

close

def close(self) -> None

Close database connections and release resources.

CorpusResult

Search results are returned as CorpusResult dataclass instances.

@dataclass
class CorpusResult:
    document_name: str      # Name of source document
    title: str              # Document title
    cluster: str            # Topic cluster
    section_heading: str | None  # Section heading (if from section)
    content: str            # Matched content
    score: float            # Relevance score
    topics: list[str]       # Document topics
    search_type: Literal["keyword", "semantic"]

Properties

source

@property
def source(self) -> str

Human-readable source reference, e.g., "dialogue_craft > Subtext Techniques".

Context Manager

The Corpus class supports context manager protocol for automatic cleanup:

with Corpus() as corpus:
    results = corpus.search("pacing")
    # ... process results
# Resources automatically released