Skip to content

Embeddings API

The embeddings module provides semantic search capabilities using multiple embedding providers.

Optional Module

This module requires either the embeddings-api or embeddings extra:

# Lightweight (Ollama/OpenAI via API)
pip install ifcraftcorpus[embeddings-api]

# Heavyweight (local sentence-transformers)
pip install ifcraftcorpus[embeddings]

Overview

Semantic search finds conceptually similar content even when exact keywords don't match. It uses dense vector embeddings to represent text meaning.

Embedding Providers

The library supports three embedding providers. Use get_embedding_provider() for auto-detection or instantiate directly:

from ifcraftcorpus.providers import (
    get_embedding_provider,
    OllamaEmbeddings,
    OpenAIEmbeddings,
    SentenceTransformersEmbeddings,
)

# Auto-detect best available provider
provider = get_embedding_provider()

# Or use specific providers
provider = OllamaEmbeddings()  # Requires running Ollama server
provider = OpenAIEmbeddings()  # Requires OPENAI_API_KEY
provider = SentenceTransformersEmbeddings()  # Local, no API needed

Provider Configuration

Provider Environment Variable Default Model
Ollama OLLAMA_HOST (default: http://localhost:11434) nomic-embed-text
OpenAI OPENAI_API_KEY text-embedding-3-small
SentenceTransformers (none) all-MiniLM-L6-v2

Using with Corpus

from pathlib import Path
from ifcraftcorpus import Corpus
from ifcraftcorpus.providers import OllamaEmbeddings

corpus = Corpus(
    embeddings_path=Path("embeddings/"),
    embedding_provider=OllamaEmbeddings()
)

# Build embeddings (if not already built)
corpus.build_embeddings()

# Use semantic search
results = corpus.search("creating tension", mode="semantic")
from pathlib import Path
from ifcraftcorpus.embeddings import EmbeddingIndex, build_embeddings_from_index

# Build from an existing index (corpus_index is a populated CorpusIndex)
embeddings = build_embeddings_from_index(corpus_index)
embeddings.save(Path("embeddings/"))

# Load and search
embeddings = EmbeddingIndex.load(Path("embeddings/"))
results = embeddings.search("creating tension in scenes")

Classes

EmbeddingIndex

Vector embedding index for semantic search.

class EmbeddingIndex:
    def __init__(
        self,
        model_name: str = "all-MiniLM-L6-v2",
        *,
        lazy_load: bool = True,
    ) -> None

Parameters

Parameter Type Default Description
model_name str "all-MiniLM-L6-v2" Sentence-transformers model name
lazy_load bool True If True, load model on first use

Methods

add_texts
def add_texts(
    self,
    texts: list[str],
    metadata: list[dict],
) -> None

Add texts with metadata to the index.

Parameters:

Parameter Type Description
texts list[str] Text strings to embed
metadata list[dict] Metadata dicts (same length as texts)

Example:

index = EmbeddingIndex()
index.add_texts(
    ["Text about dialogue", "Text about pacing"],
    [
        {"document_name": "dialogue", "title": "Dialogue Craft"},
        {"document_name": "pacing", "title": "Scene Pacing"}
    ]
)
def search(
    self,
    query: str,
    *,
    top_k: int = 10,
    cluster: str | None = None,
) -> list[tuple[dict, float]]

Search for semantically similar texts.

Parameters:

Parameter Type Default Description
query str required Search query text
top_k int 10 Maximum results to return
cluster str \| None None Optional cluster name to filter results. Only returns matches where metadata["cluster"] equals this value.

Returns: List of (metadata, similarity_score) tuples. Scores range from 0 to 1.

Example:

results = index.search("building suspense in scenes")
for metadata, score in results:
    print(f"{metadata['title']}: {score:.3f}")
save
def save(self, path: Path) -> None

Save the index to disk.

Parameters:

Parameter Type Description
path Path Directory to save index files

Files Created:

  • embeddings.npy - NumPy array of vectors
  • metadata.json - Model name and metadata
load (classmethod)
@classmethod
def load(cls, path: Path) -> EmbeddingIndex

Load an index from disk.

Parameters:

Parameter Type Description
path Path Directory containing saved index

Returns: Loaded EmbeddingIndex instance.

Example:

index = EmbeddingIndex.load(Path("embeddings/"))
results = index.search("dialogue techniques")
__len__
def __len__(self) -> int

Get the number of indexed items.

Functions

build_embeddings_from_index

def build_embeddings_from_index(
    corpus_index: CorpusIndex,
    model_name: str = "all-MiniLM-L6-v2",
) -> EmbeddingIndex

Build an embedding index from an existing CorpusIndex.

Parameters:

Parameter Type Default Description
corpus_index CorpusIndex required Populated corpus index
model_name str "all-MiniLM-L6-v2" Model name

Returns: New EmbeddingIndex with all corpus content.

Example:

from pathlib import Path
from ifcraftcorpus.index import CorpusIndex
from ifcraftcorpus.embeddings import build_embeddings_from_index

# Build FTS index
index = CorpusIndex()
index.build_from_directory(Path("corpus"))

# Generate embeddings
embeddings = build_embeddings_from_index(index)
embeddings.save(Path("embeddings/"))
print(f"Created {len(embeddings)} embeddings")

Constants

DEFAULT_MODEL

DEFAULT_MODEL = "all-MiniLM-L6-v2"

The default sentence-transformers model. This is a small, fast model that works well for semantic similarity tasks.

Integration with Corpus

To use semantic search with the main Corpus API:

from pathlib import Path
from ifcraftcorpus import Corpus

# Provide path to saved embeddings
corpus = Corpus(embeddings_path=Path("embeddings/"))

# Use semantic search mode
results = corpus.search("scary atmosphere", mode="semantic")

# Or hybrid mode (combines keyword + semantic)
results = corpus.search("horror techniques", mode="hybrid")

# Check if semantic search is available
if corpus.has_semantic_search:
    print("Semantic search enabled")