Embeddings API¶
The embeddings module provides semantic search capabilities using multiple embedding providers.
Optional Module
This module requires either the embeddings-api or embeddings extra:
Overview¶
Semantic search finds conceptually similar content even when exact keywords don't match. It uses dense vector embeddings to represent text meaning.
Embedding Providers¶
The library supports three embedding providers. Use get_embedding_provider() for auto-detection or instantiate directly:
from ifcraftcorpus.providers import (
get_embedding_provider,
OllamaEmbeddings,
OpenAIEmbeddings,
SentenceTransformersEmbeddings,
)
# Auto-detect best available provider
provider = get_embedding_provider()
# Or use specific providers
provider = OllamaEmbeddings() # Requires running Ollama server
provider = OpenAIEmbeddings() # Requires OPENAI_API_KEY
provider = SentenceTransformersEmbeddings() # Local, no API needed
Provider Configuration¶
| Provider | Environment Variable | Default Model |
|---|---|---|
| Ollama | OLLAMA_HOST (default: http://localhost:11434) |
nomic-embed-text |
| OpenAI | OPENAI_API_KEY |
text-embedding-3-small |
| SentenceTransformers | (none) | all-MiniLM-L6-v2 |
Using with Corpus¶
from pathlib import Path
from ifcraftcorpus import Corpus
from ifcraftcorpus.providers import OllamaEmbeddings
corpus = Corpus(
embeddings_path=Path("embeddings/"),
embedding_provider=OllamaEmbeddings()
)
# Build embeddings (if not already built)
corpus.build_embeddings()
# Use semantic search
results = corpus.search("creating tension", mode="semantic")
from pathlib import Path
from ifcraftcorpus.embeddings import EmbeddingIndex, build_embeddings_from_index
# Build from an existing index (corpus_index is a populated CorpusIndex)
embeddings = build_embeddings_from_index(corpus_index)
embeddings.save(Path("embeddings/"))
# Load and search
embeddings = EmbeddingIndex.load(Path("embeddings/"))
results = embeddings.search("creating tension in scenes")
Classes¶
EmbeddingIndex¶
Vector embedding index for semantic search.
class EmbeddingIndex:
def __init__(
self,
model_name: str = "all-MiniLM-L6-v2",
*,
lazy_load: bool = True,
) -> None
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model_name |
str |
"all-MiniLM-L6-v2" |
Sentence-transformers model name |
lazy_load |
bool |
True |
If True, load model on first use |
Methods¶
add_texts¶
Add texts with metadata to the index.
Parameters:
| Parameter | Type | Description |
|---|---|---|
texts |
list[str] |
Text strings to embed |
metadata |
list[dict] |
Metadata dicts (same length as texts) |
Example:
index = EmbeddingIndex()
index.add_texts(
["Text about dialogue", "Text about pacing"],
[
{"document_name": "dialogue", "title": "Dialogue Craft"},
{"document_name": "pacing", "title": "Scene Pacing"}
]
)
search¶
def search(
self,
query: str,
*,
top_k: int = 10,
cluster: str | None = None,
) -> list[tuple[dict, float]]
Search for semantically similar texts.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
query |
str |
required | Search query text |
top_k |
int |
10 |
Maximum results to return |
cluster |
str \| None |
None |
Optional cluster name to filter results. Only returns matches where metadata["cluster"] equals this value. |
Returns: List of (metadata, similarity_score) tuples. Scores range from 0 to 1.
Example:
results = index.search("building suspense in scenes")
for metadata, score in results:
print(f"{metadata['title']}: {score:.3f}")
save¶
Save the index to disk.
Parameters:
| Parameter | Type | Description |
|---|---|---|
path |
Path |
Directory to save index files |
Files Created:
embeddings.npy- NumPy array of vectorsmetadata.json- Model name and metadata
load (classmethod)¶
Load an index from disk.
Parameters:
| Parameter | Type | Description |
|---|---|---|
path |
Path |
Directory containing saved index |
Returns: Loaded EmbeddingIndex instance.
Example:
__len__¶
Get the number of indexed items.
Functions¶
build_embeddings_from_index¶
def build_embeddings_from_index(
corpus_index: CorpusIndex,
model_name: str = "all-MiniLM-L6-v2",
) -> EmbeddingIndex
Build an embedding index from an existing CorpusIndex.
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
corpus_index |
CorpusIndex |
required | Populated corpus index |
model_name |
str |
"all-MiniLM-L6-v2" |
Model name |
Returns: New EmbeddingIndex with all corpus content.
Example:
from pathlib import Path
from ifcraftcorpus.index import CorpusIndex
from ifcraftcorpus.embeddings import build_embeddings_from_index
# Build FTS index
index = CorpusIndex()
index.build_from_directory(Path("corpus"))
# Generate embeddings
embeddings = build_embeddings_from_index(index)
embeddings.save(Path("embeddings/"))
print(f"Created {len(embeddings)} embeddings")
Constants¶
DEFAULT_MODEL¶
The default sentence-transformers model. This is a small, fast model that works well for semantic similarity tasks.
Integration with Corpus¶
To use semantic search with the main Corpus API:
from pathlib import Path
from ifcraftcorpus import Corpus
# Provide path to saved embeddings
corpus = Corpus(embeddings_path=Path("embeddings/"))
# Use semantic search mode
results = corpus.search("scary atmosphere", mode="semantic")
# Or hybrid mode (combines keyword + semantic)
results = corpus.search("horror techniques", mode="hybrid")
# Check if semantic search is available
if corpus.has_semantic_search:
print("Semantic search enabled")