Memory (RAG)

One-line summary

The memory system provides retrieval-augmented generation via a hybrid search engine (vector + FTS5 keyword + MMR re-ranking) over a local SQLite database, injecting relevant context chunks into each LLM conversation.

Responsibilities

  • Index Markdown files from agent workspace into searchable chunks (vector embeddings + FTS5)
  • Execute hybrid search (vector similarity + keyword matching) with configurable weights
  • Apply MMR (Maximal Marginal Relevance) re-ranking for result diversity
  • Apply temporal decay to prioritize recent content
  • Support multiple embedding providers (OpenAI, Gemini, Voyage, Mistral, local node-llama-cpp)
  • Gracefully fall back to FTS-only mode when no embedding provider is available
  • Inject search results as context into agent conversations

Architecture diagram

Key source files

FileLinesRole
src/memory/manager.ts640Main API: MemoryIndexManager class — search, sync, status, read
src/memory/manager-search.ts~200Search implementation: searchVector(), searchKeyword()
src/memory/manager-sync-ops.ts~1000+File indexing: chunking, embedding, database writes
src/memory/hybrid.ts149Hybrid merge: mergeHybridResults() with configurable weights
src/memory/embeddings.ts296Provider factory: createEmbeddingProvider() with multi-provider fallback
src/memory/mmr.ts~200MMR diversity re-ranking algorithm
src/memory/temporal-decay.ts~150Temporal decay scoring
src/memory/query-expansion.ts~500Query keyword extraction for FTS improvement
src/memory/qmd-manager.ts~2000+Advanced query/metadata management
src/memory/search-manager.ts~250Search orchestration

Data flow

Indexing

Markdown file created/modified in agent workspace

File watcher (chokidar) detects change

manager-sync-ops.ts chunks the file

Generate embeddings (if provider available)

Store chunks + embeddings in SQLite (memory.db)

Update FTS5 index for keyword search

Search & injection

Before agent turn (or via memory_search tool):

manager.ts search()
  ├── Query expansion (extractKeywords)
  ├── Vector search (sqlite-vec cosine similarity)
  ├── Keyword search (FTS5 BM25 ranking)
  ├── Hybrid merge (weighted combination)
  ├── MMR re-ranking (diversity)
  └── Temporal decay (recency)

Top results (maxResults, filtered by minScore)

Injected as context into conversation

Token optimization impact

MechanismToken costDetails
Search results500-2,000 tokens/turnmaxResults chunks × SNIPPET_MAX_CHARS (700 chars) per chunk
Per-chunk cost~175 tokens max700 chars / 4 ≈ 175 tokens per chunk
Typical injection5-10 chunks × 175 = ~875-1,750 tokensDepends on maxResults config

Configuration knobs

agents.defaults.memorySearch.query:
  maxResults: 5-20 (how many chunks to inject)
  minScore: 0-1 (quality threshold)
  hybrid:
    vectorWeight: 0-1
    textWeight: 0-1

Optimization opportunities

  • Lower maxResults: Each result adds ~175 tokens; reducing from 10 to 5 saves ~875 tokens/turn
  • Higher minScore: Filter out low-quality results that add tokens without value
  • MMR diversity: Already helps by removing redundant chunks
  • FTS-only mode: Zero embedding API cost, still functional for keyword-heavy queries
  • Temporal decay: Naturally deprioritizes stale content

Embedding provider cost

ProviderCostQuality
Local (node-llama-cpp)Free (compute only)Lower
OpenAI text-embedding-3-small~$0.02/M tokensGood
Voyage voyage-3~$0.06/M tokensHigh
GeminiVariableGood
FTS-only (no embeddings)FreeKeyword-only

How it connects to other modules

  • Depends on:

    • config/ — memory settings, embedding provider configuration
    • sessions/ — session-aware indexing
    • SQLite + sqlite-vec — storage engine
    • chokidar — file system watching
  • Depended by:

    • agents/pi-embedded-runner/ — memory context injected via Pi SDK hooks
    • system-prompt.ts — memory recall instructions
    • auto-reply/reply/memory-flush.ts — memory flush after replies

My blind spots

  • Exact chunk size and overlap settings — how files are split into chunks
  • maxResults default value — need to find in config defaults
  • How memory context is actually injected — as system message or user message?
  • qmd-manager.ts (2000+ lines) — advanced query metadata system, largely unexplored
  • Batch embedding operations — how they handle rate limits and failures
  • Whether search results are cached across turns
  • manager-embedding-ops.ts — embedding lifecycle and re-indexing strategy
  • None yet

Change frequency

  • manager.ts: Medium — search API evolves with new features
  • hybrid.ts: Low — merge algorithm is stable
  • embeddings.ts: Medium — new providers added as they become available
  • mmr.ts: Low — diversity algorithm is mature
  • qmd-manager.ts: Medium — advanced query features actively developed