Corpus API¶
The Corpus class is the main interface for searching the IF Craft Corpus.
Basic Usage¶
Class Reference¶
Corpus¶
class Corpus:
def __init__(
self,
*,
corpus_dir: Path | None = None,
index_path: Path | None = None,
embeddings_path: Path | None = None,
use_bundled: bool = True,
) -> None
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
corpus_dir |
Path \| None |
None |
Path to corpus markdown files. Uses bundled corpus if None. |
index_path |
Path \| None |
None |
Path to pre-built SQLite index. Builds in-memory if None. |
embeddings_path |
Path \| None |
None |
Path to pre-built embeddings. Disables semantic search if None. |
use_bundled |
bool |
True |
Use bundled corpus files when corpus_dir is None. |
Methods¶
search¶
def search(
self,
query: str,
*,
cluster: str | None = None,
limit: int = 10,
mode: Literal["keyword", "semantic", "hybrid"] = "keyword",
) -> list[CorpusResult]
Search the corpus for matching content.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
str |
required | Search query text |
cluster |
str \| None |
None |
Filter results to specific cluster |
limit |
int |
10 |
Maximum results to return |
mode |
str |
"keyword" |
Search mode: keyword, semantic, or hybrid |
Returns: List of CorpusResult objects.
Example:
# Basic keyword search
results = corpus.search("character voice")
# Filter by cluster
results = corpus.search("tension", cluster="emotional-design", limit=5)
# Semantic search (requires embeddings)
results = corpus.search("scary atmosphere", mode="semantic")
get_document¶
Retrieve a specific document by name.
Parameters:
| Parameter | Type | Description |
|---|---|---|
name |
str |
Document name (filename stem without .md) |
Returns: Document dict with keys: name, path, title, summary, cluster, topics, sections. Returns None if not found.
Example:
doc = corpus.get_document("dialogue_craft")
if doc:
print(f"Title: {doc['title']}")
for section in doc['sections']:
print(f" - {section['heading']}")
list_documents¶
List all documents in the corpus.
Returns: List of dicts with keys: name, title, cluster, topics.
list_clusters¶
Get all available cluster names.
Returns: Sorted list of cluster name strings.
document_count¶
Get total number of documents in the corpus.
has_semantic_search¶
Check if semantic search is available (embeddings loaded).
close¶
Close database connections and release resources.
CorpusResult¶
Search results are returned as CorpusResult dataclass instances.
@dataclass
class CorpusResult:
document_name: str # Name of source document
title: str # Document title
cluster: str # Topic cluster
section_heading: str | None # Section heading (if from section)
content: str # Matched content
score: float # Relevance score
topics: list[str] # Document topics
search_type: Literal["keyword", "semantic"]
Properties¶
source¶
Human-readable source reference, e.g., "dialogue_craft > Subtext Techniques".
Context Manager¶
The Corpus class supports context manager protocol for automatic cleanup: