Embeddings API¶

The embeddings module provides semantic search capabilities using multiple embedding providers.

Optional Module

This module requires either the embeddings-api or embeddings extra:

# Lightweight (Ollama/OpenAI via API)
pip install ifcraftcorpus[embeddings-api]

# Heavyweight (local sentence-transformers)
pip install ifcraftcorpus[embeddings]

Overview¶

Semantic search finds conceptually similar content even when exact keywords don't match. It uses dense vector embeddings to represent text meaning.

Embedding Providers¶

The library supports three embedding providers. Use get_embedding_provider() for auto-detection or instantiate directly:

from ifcraftcorpus.providers import (
    get_embedding_provider,
    OllamaEmbeddings,
    OpenAIEmbeddings,
    SentenceTransformersEmbeddings,
)

# Auto-detect best available provider
provider = get_embedding_provider()

# Or use specific providers
provider = OllamaEmbeddings()  # Requires running Ollama server
provider = OpenAIEmbeddings()  # Requires OPENAI_API_KEY
provider = SentenceTransformersEmbeddings()  # Local, no API needed

Provider Configuration¶

Provider	Environment Variable	Default Model
Ollama	`OLLAMA_HOST` (default: `http://localhost:11434`)	`nomic-embed-text`
OpenAI	`OPENAI_API_KEY`	`text-embedding-3-small`
SentenceTransformers	(none)	`all-MiniLM-L6-v2`

Using with Corpus¶

from pathlib import Path
from ifcraftcorpus import Corpus
from ifcraftcorpus.providers import OllamaEmbeddings

corpus = Corpus(
    embeddings_path=Path("embeddings/"),
    embedding_provider=OllamaEmbeddings()
)

# Build embeddings (if not already built)
corpus.build_embeddings()

# Use semantic search
results = corpus.search("creating tension", mode="semantic")

from pathlib import Path
from ifcraftcorpus.embeddings import EmbeddingIndex, build_embeddings_from_index

# Build from an existing index (corpus_index is a populated CorpusIndex)
embeddings = build_embeddings_from_index(corpus_index)
embeddings.save(Path("embeddings/"))

# Load and search
embeddings = EmbeddingIndex.load(Path("embeddings/"))
results = embeddings.search("creating tension in scenes")

Classes¶

EmbeddingIndex¶

Vector embedding index for semantic search.

class EmbeddingIndex:
    def __init__(
        self,
        model_name: str = "all-MiniLM-L6-v2",
        *,
        lazy_load: bool = True,
    ) -> None

Parameters¶

Parameter	Type	Default	Description
`model_name`	`str`	`"all-MiniLM-L6-v2"`	Sentence-transformers model name
`lazy_load`	`bool`	`True`	If True, load model on first use

Methods¶

add_texts¶

def add_texts(
    self,
    texts: list[str],
    metadata: list[dict],
) -> None

Add texts with metadata to the index.

Parameters:

Parameter	Type	Description
`texts`	`list[str]`	Text strings to embed
`metadata`	`list[dict]`	Metadata dicts (same length as texts)

Example:

index = EmbeddingIndex()
index.add_texts(
    ["Text about dialogue", "Text about pacing"],
    [
        {"document_name": "dialogue", "title": "Dialogue Craft"},
        {"document_name": "pacing", "title": "Scene Pacing"}
    ]
)

search¶

def search(
    self,
    query: str,
    *,
    top_k: int = 10,
    cluster: str | None = None,
) -> list[tuple[dict, float]]

Search for semantically similar texts.

Parameters:

Parameter	Type	Default	Description
`query`	`str`	required	Search query text
`top_k`	`int`	`10`	Maximum results to return
`cluster`	`str \\| None`	`None`	Optional cluster name to filter results. Only returns matches where `metadata["cluster"]` equals this value.

Returns: List of (metadata, similarity_score) tuples. Scores range from 0 to 1.

Example:

results = index.search("building suspense in scenes")
for metadata, score in results:
    print(f"{metadata['title']}: {score:.3f}")

save¶

def save(self, path: Path) -> None

Save the index to disk.

Parameters:

Parameter	Type	Description
`path`	`Path`	Directory to save index files

Files Created:

embeddings.npy - NumPy array of vectors
metadata.json - Model name and metadata

load (classmethod)¶

@classmethod
def load(cls, path: Path) -> EmbeddingIndex

Load an index from disk.

Parameters:

Parameter	Type	Description
`path`	`Path`	Directory containing saved index

Returns: Loaded EmbeddingIndex instance.

Example:

index = EmbeddingIndex.load(Path("embeddings/"))
results = index.search("dialogue techniques")

len¶

def __len__(self) -> int

Get the number of indexed items.

Functions¶

build_embeddings_from_index¶

def build_embeddings_from_index(
    corpus_index: CorpusIndex,
    model_name: str = "all-MiniLM-L6-v2",
) -> EmbeddingIndex

Build an embedding index from an existing CorpusIndex.

Parameters:

Parameter	Type	Default	Description
`corpus_index`	`CorpusIndex`	required	Populated corpus index
`model_name`	`str`	`"all-MiniLM-L6-v2"`	Model name

Returns: New EmbeddingIndex with all corpus content.

Example:

from pathlib import Path
from ifcraftcorpus.index import CorpusIndex
from ifcraftcorpus.embeddings import build_embeddings_from_index

# Build FTS index
index = CorpusIndex()
index.build_from_directory(Path("corpus"))

# Generate embeddings
embeddings = build_embeddings_from_index(index)
embeddings.save(Path("embeddings/"))
print(f"Created {len(embeddings)} embeddings")

Constants¶

DEFAULT_MODEL¶

DEFAULT_MODEL = "all-MiniLM-L6-v2"

The default sentence-transformers model. This is a small, fast model that works well for semantic similarity tasks.

Integration with Corpus¶

To use semantic search with the main Corpus API:

from pathlib import Path
from ifcraftcorpus import Corpus

# Provide path to saved embeddings
corpus = Corpus(embeddings_path=Path("embeddings/"))

# Use semantic search mode
results = corpus.search("scary atmosphere", mode="semantic")

# Or hybrid mode (combines keyword + semantic)
results = corpus.search("horror techniques", mode="hybrid")

# Check if semantic search is available
if corpus.has_semantic_search:
    print("Semantic search enabled")

Embeddings API¶

Overview¶

Embedding Providers¶

Provider Configuration¶

Using with Corpus¶

Classes¶

EmbeddingIndex¶

Parameters¶

Methods¶

add_texts¶

search¶

save¶

load (classmethod)¶

__len__¶

Functions¶

build_embeddings_from_index¶

Constants¶

DEFAULT_MODEL¶

Integration with Corpus¶

len¶