Index API¶
The index module provides SQLite FTS5 full-text search capabilities.
Classes¶
CorpusIndex¶
SQLite FTS5 index for fast keyword-based search.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
db_path |
Path \| str |
":memory:" |
Path to SQLite database file, or ":memory:" for in-memory |
Methods¶
build_from_directory¶
Build the index from a corpus directory.
Returns: Number of documents indexed.
Example:
from pathlib import Path
from ifcraftcorpus.index import CorpusIndex
index = CorpusIndex("corpus.db")
count = index.build_from_directory(Path("corpus"))
print(f"Indexed {count} documents")
add_document¶
Add a single parsed document to the index.
Returns: The document ID in the database.
search¶
def search(
self,
query: str,
*,
cluster: str | None = None,
limit: int = 10,
) -> list[SearchResult]
Search the index using FTS5 query syntax.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
str |
required | FTS5 search query |
cluster |
str \| None |
None |
Filter to specific cluster |
limit |
int |
10 |
Maximum results |
Returns: List of SearchResult objects ranked by BM25 score.
FTS5 Query Examples:
# Simple keyword search
results = index.search("dialogue")
# Phrase search
results = index.search('"character voice"')
# Boolean operators
results = index.search("tension OR suspense")
results = index.search("horror NOT comedy")
# Prefix search
results = index.search("narrat*")
# Column-specific search
results = index.search("title:craft")
get_document¶
Retrieve a document by name with all sections.
list_documents¶
List all indexed documents.
list_clusters¶
List all clusters in the index.
document_count¶
Get total document count.
close¶
Close the database connection.
SearchResult¶
@dataclass
class SearchResult:
document_name: str
title: str
cluster: str
section_heading: str | None
content: str
score: float
topics: list[str]
Properties¶
source¶
Human-readable source reference.
Functions¶
build_index¶
Build and save a corpus index to a file.
Parameters:
| Parameter | Type | Description |
|---|---|---|
corpus_dir |
Path |
Path to corpus markdown files |
output_path |
Path |
Path for output SQLite database |
Returns: The built CorpusIndex instance.
Example:
from pathlib import Path
from ifcraftcorpus.index import build_index
index = build_index(
corpus_dir=Path("corpus"),
output_path=Path("dist/corpus.db")
)
print(f"Built index with {index.document_count()} documents")
Database Schema¶
The index uses three tables:
documents¶
Stores document metadata:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
name |
TEXT | Document name (unique) |
path |
TEXT | Original file path |
title |
TEXT | Document title |
summary |
TEXT | Document summary |
cluster |
TEXT | Topic cluster |
topics |
TEXT | Comma-separated topics |
content_hash |
TEXT | Content hash for change detection |
sections¶
Stores document sections:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key |
document_id |
INTEGER | Foreign key to documents |
heading |
TEXT | Section heading |
level |
INTEGER | Heading level (1-3) |
content |
TEXT | Section content |
line_start |
INTEGER | Line number in source |
corpus_fts¶
FTS5 virtual table for full-text search with Porter stemming and Unicode support.