Skip to content

Index API

The index module provides SQLite FTS5 full-text search capabilities.

Classes

CorpusIndex

SQLite FTS5 index for fast keyword-based search.

class CorpusIndex:
    def __init__(self, db_path: Path | str = ":memory:") -> None

Parameters:

Parameter Type Default Description
db_path Path \| str ":memory:" Path to SQLite database file, or ":memory:" for in-memory

Methods

build_from_directory
def build_from_directory(self, corpus_dir: Path) -> int

Build the index from a corpus directory.

Returns: Number of documents indexed.

Example:

from pathlib import Path
from ifcraftcorpus.index import CorpusIndex

index = CorpusIndex("corpus.db")
count = index.build_from_directory(Path("corpus"))
print(f"Indexed {count} documents")
add_document
def add_document(self, doc: Document) -> int

Add a single parsed document to the index.

Returns: The document ID in the database.

def search(
    self,
    query: str,
    *,
    cluster: str | None = None,
    limit: int = 10,
) -> list[SearchResult]

Search the index using FTS5 query syntax.

Parameters:

Parameter Type Default Description
query str required FTS5 search query
cluster str \| None None Filter to specific cluster
limit int 10 Maximum results

Returns: List of SearchResult objects ranked by BM25 score.

FTS5 Query Examples:

# Simple keyword search
results = index.search("dialogue")

# Phrase search
results = index.search('"character voice"')

# Boolean operators
results = index.search("tension OR suspense")
results = index.search("horror NOT comedy")

# Prefix search
results = index.search("narrat*")

# Column-specific search
results = index.search("title:craft")
get_document
def get_document(self, name: str) -> dict | None

Retrieve a document by name with all sections.

list_documents
def list_documents(self) -> list[dict[str, str]]

List all indexed documents.

list_clusters
def list_clusters(self) -> list[str]

List all clusters in the index.

document_count
def document_count(self) -> int

Get total document count.

close
def close(self) -> None

Close the database connection.

SearchResult

@dataclass
class SearchResult:
    document_name: str
    title: str
    cluster: str
    section_heading: str | None
    content: str
    score: float
    topics: list[str]

Properties

source
@property
def source(self) -> str

Human-readable source reference.

Functions

build_index

def build_index(corpus_dir: Path, output_path: Path) -> CorpusIndex

Build and save a corpus index to a file.

Parameters:

Parameter Type Description
corpus_dir Path Path to corpus markdown files
output_path Path Path for output SQLite database

Returns: The built CorpusIndex instance.

Example:

from pathlib import Path
from ifcraftcorpus.index import build_index

index = build_index(
    corpus_dir=Path("corpus"),
    output_path=Path("dist/corpus.db")
)
print(f"Built index with {index.document_count()} documents")

Database Schema

The index uses three tables:

documents

Stores document metadata:

Column Type Description
id INTEGER Primary key
name TEXT Document name (unique)
path TEXT Original file path
title TEXT Document title
summary TEXT Document summary
cluster TEXT Topic cluster
topics TEXT Comma-separated topics
content_hash TEXT Content hash for change detection

sections

Stores document sections:

Column Type Description
id INTEGER Primary key
document_id INTEGER Foreign key to documents
heading TEXT Section heading
level INTEGER Heading level (1-3)
content TEXT Section content
line_start INTEGER Line number in source

corpus_fts

FTS5 virtual table for full-text search with Porter stemming and Unicode support.