Skip to content

Experience Memory Engine - Idea

Why It Can't Be Just One Thing

MCP server alone won't work. MCP is request-response — the agent calls a tool, gets a result. But Experience Memory needs to do background work when nobody is asking it anything: overnight revision, inference chains, confidence decay, event monitoring. An MCP server that only wakes up when called can't do that.

Python package alone won't work. If it's embedded in the agent's process, it dies when the agent isn't running, it competes for the agent's compute resources during conversations, and you can't scale or update it independently.

A standalone background service is the right foundation — but it should expose an MCP interface as one of its surfaces, so the main agent can interact with it using the standard tool protocol your platform already speaks.

Think of it as: Experience Memory is a service that happens to have an MCP interface, not an MCP server that happens to have background processing.

The Architecture I'd Recommend

Three layers, each with a distinct role:

graph TB
    subgraph Experience Memory Service
        direction TB

        subgraph API_Layer ["How the world talks to EM"]
            MCP["MCP Interface<br/>Agent queries graph,<br/>reports interactions,<br/>fetches suggestions"]
            GRPC["gRPC / Internal API<br/>High-throughput event<br/>ingestion from agent"]
        end

        subgraph Processing_Core ["The brain"]
            EP["Extraction Pipeline<br/>NLU + entity/relation<br/>extraction + hedging"]
            GDE["Graph Diff Engine<br/>New vs reinforcement<br/>vs contradiction"]
            IE["Inference Engine<br/>Cross-context pattern<br/>detection"]
            PE["Proactive Engine<br/>Trigger evaluation,<br/>risk model, probe queue"]
        end

        subgraph Background_Workers ["The night shift"]
            REV["Revision Worker<br/>Fact verification,<br/>confidence decay"]
            INF["Inference Worker<br/>Overnight inference<br/>chains"]
            MON["Event Monitor<br/>News, weather, market,<br/>calendar triggers"]
            SCHED["Scheduler<br/>Cron-like task<br/>management"]
        end

        subgraph Storage Layer
            NEO["Neo4j<br/>Knowledge Graph"]
            VEC["Vector Store<br/>Episode embeddings<br/>for similarity search"]
            QUEUE_IN["Inbound Queue<br/>User interactions,<br/>agent events"]
            QUEUE_OUT["Outbound Queue<br/>Probes, suggestions,<br/>conversation starters"]
        end

        subgraph LLM Access
            LLM_SMALL["Small/Fast LLM<br/>Extraction, classification,<br/>hedging detection"]
            LLM_LARGE["Large LLM<br/>Complex inference,<br/>experience synthesis"]
        end
    end

    MCP --> EP
    GRPC --> QUEUE_IN
    QUEUE_IN --> EP
    EP --> GDE
    GDE --> NEO
    GDE --> VEC
    GDE --> PE
    PE --> QUEUE_OUT
    IE --> NEO
    REV --> NEO
    REV --> LLM_SMALL
    INF --> LLM_LARGE
    INF --> NEO
    EP --> LLM_SMALL
    IE --> LLM_LARGE
    MON --> PE
    SCHED --> REV
    SCHED --> INF

    style MCP fill:#d4edda,stroke:#155724
    style NEO fill:#cce5ff,stroke:#004085
    style VEC fill:#cce5ff,stroke:#004085
    style QUEUE_IN fill:#fff3cd,stroke:#856404
    style QUEUE_OUT fill:#fff3cd,stroke:#856404
    style LLM_SMALL fill:#e8daef,stroke:#6c3483
    style LLM_LARGE fill:#e8daef,stroke:#6c3483

Key Architectural Decisions

Two LLMs, Not One

This is important. The extraction pipeline runs on every single user message. If you route that through a large model (Claude Opus, GPT-4), you'll burn through tokens and add latency to every conversation. The extraction, classification, and hedging detection work is well-suited to a smaller, faster model — something like Haiku or a fine-tuned local model running inside the CVM.

Reserve the large LLM for the expensive cognitive work: complex inference chains, experience synthesis, and the overnight reasoning that connects dots across the entire graph.

Task LLM Why
Entity extraction Small/fast Runs on every message, needs low latency
Relation extraction Small/fast Structured output, well-defined task
Hedging and confidence detection Small/fast Classification task
Contradiction resolution Large Requires nuanced reasoning
Overnight inference chains Large Creative cross-domain reasoning
Experience synthesis (promoting patterns to experiences) Large Needs to generalize across episodes
Probe question generation Large Needs conversational intelligence

Dual Queue Pattern

The inbound/outbound queue pattern you described is exactly right, and it solves one of the trickiest coordination problems: the agent and Experience Memory operate on different timescales.

The agent operates in real-time — it needs to respond to the user in seconds. Experience Memory operates in both real-time (extraction) and background (revision, inference). The queues decouple these timescales.

sequenceDiagram
    participant U as User
    participant AG as Main Agent
    participant IQ as Inbound Queue
    participant EM as Experience Memory
    participant OQ as Outbound Queue

    U->>AG: "Help me pick wines<br/>for the weekend"

    par Agent responds immediately
        AG->>U: "Here are 3 great options..."
    and Agent reports interaction
        AG->>IQ: Event: wine conversation,<br/>entities mentioned, user context
    end

    Note over IQ,EM: Async processing

    IQ->>EM: Process interaction
    EM->>EM: Extract entities, relations
    EM->>EM: Graph diff, update Neo4j
    EM->>EM: Check probe queue —<br/>wine context matches<br/>"wife's drink preferences" gap

    EM->>OQ: Probe ready:<br/>"Lena's birthday in a month.<br/>Does she enjoy wine?"<br/>Context: wine conversation<br/>Priority: HIGH

    Note over OQ,AG: Agent pulls when appropriate

    AG->>OQ: Any pending probes<br/>for current context?
    OQ-->>AG: Yes: wine/birthday probe

    AG->>U: "By the way — Lena's birthday<br/>is a month out. Does she<br/>enjoy wine?"

The critical detail: the agent pulls from the outbound queue, Experience Memory doesn't push to the user directly. The agent decides when and whether to deliver a probe based on the conversational flow. Experience Memory proposes; the agent disposes.

Neo4j + Vector Store — Why Both

Neo4j is the right choice for the knowledge graph — relationship traversal, pattern matching, and Cypher queries are exactly what you need for "find everything connected to Lena within 2 hops" or "what do I know about the user's travel preferences?"

But you also need a vector store for a different purpose: episode similarity. When the extraction pipeline processes a new interaction, it needs to find similar past episodes to determine if this is reinforcement of existing knowledge or something new. That's an embedding similarity search, not a graph traversal.

Storage Purpose Query Pattern
Neo4j Knowledge graph — entities, relationships, experiences, procedures "What does the user's wife like?" → Graph traversal
Vector Store Episode embeddings — past interaction similarity "Have we had a conversation like this before?" → Embedding search
Outbound Queue Pending probes, suggestions, conversation starters "Anything relevant to the current context?" → Priority queue with context matching
Inbound Queue Interaction events from the agent "New interaction to process" → FIFO with priority

The MCP Interface Surface

The MCP server is how the main agent talks to Experience Memory during conversations. It should expose a focused set of tools:

MCP Tool Purpose Called When
em_query Query the knowledge graph Agent needs context about a person, topic, preference
em_report_interaction Report a completed interaction for processing After every user conversation turn
em_get_probes Pull pending probes matching current context Agent checks for contextual probing opportunities
em_get_starters Pull pending conversation starters Agent has an opening to initiate contact
em_user_correction User explicitly corrects a fact User says "actually, that's wrong"
em_get_provenance Explain why the agent knows something User asks "why did you suggest that?"
em_graph_snapshot Export current graph state for user inspection User wants to see what the agent knows

But here's what makes it more than an MCP server: most of Experience Memory's work happens without any MCP call. The background workers, revision scheduler, inference engine, and event monitor are all running independently. The MCP interface is just the synchronous query surface for the agent.

One More Critical Consideration: The Extraction Pipeline is Your Bottleneck

Everything in the system — the graph quality, the probe relevance, the inference accuracy, the suggestion quality — depends on the extraction pipeline getting it right. If extraction is noisy, the graph fills with garbage and every downstream component degrades.

I'd recommend structuring the extraction pipeline as its own internal pipeline with distinct stages:

graph LR
    MSG["Raw Message"] --> TOK["Tokenization &<br/>Preprocessing"]
    TOK --> ENT["Entity<br/>Extraction"]
    ENT --> REL["Relation<br/>Extraction"]
    REL --> SENT["Sentiment &<br/>Hedging<br/>Detection"]
    SENT --> TEMP["Temporal<br/>Scope<br/>Detection"]
    TEMP --> CONF["Confidence<br/>Scoring"]
    CONF --> DIFF["Graph Diff<br/>Engine"]

    style MSG fill:#f0f0f0,stroke:#999
    style DIFF fill:#fff3cd,stroke:#856404

Each stage can be independently tested, evaluated, and improved. If entity extraction is weak, you fix that stage without touching the rest. If hedging detection is generating false positives, you tune that classifier independently.

And each stage should produce structured output that feeds the next stage — not free-form LLM text. Use the small LLM with constrained JSON output at each stage so you get predictable, parseable results.

Summary: What Experience Memory Actually Is

Question Answer
Is it an MCP server? It exposes an MCP interface, but it's much more than that
Is it a Python package? The core is Python, but it runs as an independent service, not an importable library
Is it a background process? Yes — that's its primary mode. Background revision, inference, and monitoring run continuously
What's its LLM strategy? Dual LLM — small/fast for extraction, large for inference and synthesis
What's its storage? Neo4j (knowledge graph) + Vector store (episode similarity) + queues (coordination)
How does the agent talk to it? MCP tools for synchronous queries + inbound queue for async event reporting
How does it talk to the agent? Outbound queue — agent pulls probes and starters when contextually appropriate
Where does it run? Inside the user's Confidential VM, alongside the agent but as a separate process

The mental model: Experience Memory is the agent's subconscious. It's always running, always processing, always connecting dots — but it surfaces insights to the conscious agent (via the outbound queue) only when they're relevant and timely.

Ready to move on to the detailed architectural diagram document? I'd suggest we cover the full system topology — the CVM internals showing how Experience Memory sits alongside the agent runtime, the inter-process communication paths, the data flows for all four acquisition patterns, and the external interfaces.