Index API¶

The index module provides SQLite FTS5 full-text search capabilities.

Classes¶

CorpusIndex¶

SQLite FTS5 index for fast keyword-based search.

class CorpusIndex:
    def __init__(self, db_path: Path | str = ":memory:") -> None

Parameters:

Parameter	Type	Default	Description
`db_path`	`Path \\| str`	`":memory:"`	Path to SQLite database file, or `":memory:"` for in-memory

Methods¶

build_from_directory¶

def build_from_directory(self, corpus_dir: Path) -> int

Build the index from a corpus directory.

Returns: Number of documents indexed.

Example:

from pathlib import Path
from ifcraftcorpus.index import CorpusIndex

index = CorpusIndex("corpus.db")
count = index.build_from_directory(Path("corpus"))
print(f"Indexed {count} documents")

add_document¶

def add_document(self, doc: Document) -> int

Add a single parsed document to the index.

Returns: The document ID in the database.

search¶

def search(
    self,
    query: str,
    *,
    cluster: str | None = None,
    limit: int = 10,
) -> list[SearchResult]

Search the index using FTS5 query syntax.

Parameters:

Parameter	Type	Default	Description
`query`	`str`	required	FTS5 search query
`cluster`	`str \\| None`	`None`	Filter to specific cluster
`limit`	`int`	`10`	Maximum results

Returns: List of SearchResult objects ranked by BM25 score.

FTS5 Query Examples:

# Simple keyword search
results = index.search("dialogue")

# Phrase search
results = index.search('"character voice"')

# Boolean operators
results = index.search("tension OR suspense")
results = index.search("horror NOT comedy")

# Prefix search
results = index.search("narrat*")

# Column-specific search
results = index.search("title:craft")

get_document¶

def get_document(self, name: str) -> dict | None

Retrieve a document by name with all sections.

list_documents¶

def list_documents(self) -> list[dict[str, str]]

List all indexed documents.

list_clusters¶

def list_clusters(self) -> list[str]

List all clusters in the index.

document_count¶

def document_count(self) -> int

Get total document count.

close¶

def close(self) -> None

Close the database connection.

SearchResult¶

@dataclass
class SearchResult:
    document_name: str
    title: str
    cluster: str
    section_heading: str | None
    content: str
    score: float
    topics: list[str]

Properties¶

source¶

@property
def source(self) -> str

Human-readable source reference.

Functions¶

build_index¶

def build_index(corpus_dir: Path, output_path: Path) -> CorpusIndex

Build and save a corpus index to a file.

Parameters:

Parameter	Type	Description
`corpus_dir`	`Path`	Path to corpus markdown files
`output_path`	`Path`	Path for output SQLite database

Returns: The built CorpusIndex instance.

Example:

from pathlib import Path
from ifcraftcorpus.index import build_index

index = build_index(
    corpus_dir=Path("corpus"),
    output_path=Path("dist/corpus.db")
)
print(f"Built index with {index.document_count()} documents")

Database Schema¶

The index uses three tables:

documents¶

Stores document metadata:

Column	Type	Description
`id`	INTEGER	Primary key
`name`	TEXT	Document name (unique)
`path`	TEXT	Original file path
`title`	TEXT	Document title
`summary`	TEXT	Document summary
`cluster`	TEXT	Topic cluster
`topics`	TEXT	Comma-separated topics
`content_hash`	TEXT	Content hash for change detection

sections¶

Stores document sections:

Column	Type	Description
`id`	INTEGER	Primary key
`document_id`	INTEGER	Foreign key to documents
`heading`	TEXT	Section heading
`level`	INTEGER	Heading level (1-3)
`content`	TEXT	Section content
`line_start`	INTEGER	Line number in source

corpus_fts¶

FTS5 virtual table for full-text search with Porter stemming and Unicode support.