Skip to content

Experience Memory: Integration, Testing, and Configuration

SecretAI Rails — Engineering Specification v1.0 — February 2026


Part I: System Integration


1. Integration Overview

Experience Memory interacts with the rest of the SecretAI Rails platform through three interaction modes, each with distinct reliability and latency requirements.

graph TB
    subgraph Synchronous  ["Critical Path"]
        direction LR
        S1["Agent calls em_query<br/>during conversation"]
        S2["Agent calls em_get_probes<br/>between turns"]
        S3["Agent calls em_get_provenance<br/>when user asks 'why?'"]
    end

    subgraph Asynchronous ["Fire and Forget"]
        direction LR
        A1["Agent reports interaction<br/>via inbound queue"]
        A2["Agent reports user correction<br/>via em_user_correction"]
        A3["Skills report outcomes<br/>via event bus"]
    end

    subgraph Background ["Autonomous
        direction LR"]
        B1["Revision workers<br/>audit graph overnight"]
        B2["Inference workers<br/>discover patterns"]
        B3["Event monitor<br/>polls external sources"]
        B4["Calendar trigger<br/>checks upcoming events"]
    end

    S1 -->|"< 100ms"| REQ["Requirement:<br/>Fast, graceful fallback"]
    A1 -->|"< 10ms enqueue"| REQ2["Requirement:<br/>Buffer, never block"]
    B1 -->|"No latency constraint"| REQ3["Requirement:<br/>Don't starve sync path"]

    style S1 fill:#fdd,stroke:#c00
    style S2 fill:#fdd,stroke:#c00
    style S3 fill:#fdd,stroke:#c00
    style A1 fill:#fff3cd,stroke:#856404
    style A2 fill:#fff3cd,stroke:#856404
    style A3 fill:#fff3cd,stroke:#856404
    style B1 fill:#d4edda,stroke:#155724
    style B2 fill:#d4edda,stroke:#155724
    style B3 fill:#d4edda,stroke:#155724
    style B4 fill:#d4edda,stroke:#155724

2. Agent Runtime ↔ Experience Memory

This is the primary integration surface. The Agent Runtime is the main consumer of Experience Memory.

2.1 Conversation Lifecycle Integration

sequenceDiagram
    participant U as User
    participant AR as Agent Runtime
    participant EM_MCP as EM MCP Server
    participant IQ as Inbound Queue
    participant OQ as Outbound Queue
    participant NEO as Neo4j

    Note over AR: User message arrives

    rect rgb(255, 235, 235)
        Note over AR,EM_MCP: Phase 1: Context Retrieval (Sync, Critical Path)
        AR->>EM_MCP: em_query({entity: "Lena",<br/>min_confidence: 0.5, max_hops: 2})
        EM_MCP->>NEO: Cypher query
        NEO-->>EM_MCP: Subgraph result
        EM_MCP-->>AR: {entities: [...], edges: [...]}
        Note over AR: Agent now has context<br/>about Lena for this turn
    end

    rect rgb(235, 255, 235)
        Note over AR: Phase 2: Generate Response
        AR->>AR: LLM call with graph context<br/>injected into system prompt
        AR->>U: Response delivered
    end

    rect rgb(255, 248, 220)
        Note over AR,IQ: Phase 3: Report Interaction (Async, Non-blocking)
        AR->>IQ: em_report_interaction({<br/>text: "...",<br/>entities_mentioned: ["Lena", "wine"],<br/>session_id: "...",<br/>timestamp: "..."})
        Note over IQ: Queued for extraction.<br/>Agent does NOT wait.
    end

    rect rgb(220, 235, 255)
        Note over AR,OQ: Phase 4: Check for Probes (Sync, Optional)
        AR->>OQ: em_get_probes({<br/>active_topics: ["wine"],<br/>entities_in_scope: ["Lena"]})
        OQ-->>AR: Probe: "Lena's birthday..."<br/>or empty []

        alt Probe available and context fits
            AR->>U: "By the way — Lena's birthday..."
        else No probe or poor fit
            Note over AR: Skip, continue normally
        end
    end

2.2 Failure Modes and Fallbacks

Every integration point must have a defined failure mode. Experience Memory is an enhancement, not a dependency — the agent must always be able to respond even if EM is completely down.

graph TB
    subgraph em_query Failure
        F1_TRIGGER["Neo4j timeout<br/>or EM process down"]
        F1_ACTION["Fallback: Agent responds<br/>without graph context.<br/>Use conversation history only."]
        F1_RECOVER["Log missed query.<br/>EM restarts automatically<br/>via supervisor."]
    end

    subgraph em_report_interaction Failure
        F2_TRIGGER["Redis queue full<br/>or EM ingestion stalled"]
        F2_ACTION["Fallback: Buffer in Agent's<br/>local memory (last 100 events).<br/>Retry on reconnect."]
        F2_RECOVER["Agent drains buffer<br/>to queue when EM recovers."]
    end

    subgraph em_get_probes Failure
        F3_TRIGGER["Outbound queue<br/>unreachable"]
        F3_ACTION["Fallback: Skip probing.<br/>Conversation continues normally.<br/>User never notices."]
        F3_RECOVER["Probes remain in queue.<br/>Delivered on next successful check."]
    end

    subgraph Background Worker Failure
        F4_TRIGGER["Revision or inference<br/>worker crashes"]
        F4_ACTION["Fallback: No immediate impact.<br/>Knowledge graph is stale<br/>but still functional."]
        F4_RECOVER["Supervisor restarts worker.<br/>Missed batch runs at next<br/>scheduled window."]
    end

    style F1_TRIGGER fill:#fdd,stroke:#c00
    style F2_TRIGGER fill:#fdd,stroke:#c00
    style F3_TRIGGER fill:#fdd,stroke:#c00
    style F4_TRIGGER fill:#fdd,stroke:#c00
    style F1_ACTION fill:#fff3cd,stroke:#856404
    style F2_ACTION fill:#fff3cd,stroke:#856404
    style F3_ACTION fill:#fff3cd,stroke:#856404
    style F4_ACTION fill:#fff3cd,stroke:#856404
    style F1_RECOVER fill:#d4edda,stroke:#155724
    style F2_RECOVER fill:#d4edda,stroke:#155724
    style F3_RECOVER fill:#d4edda,stroke:#155724
    style F4_RECOVER fill:#d4edda,stroke:#155724
Integration Point Failure Impact User Visible? Recovery
em_query timeout Agent responds without graph context Subtle — less personalized response Auto-retry next turn
em_query returns empty Agent has no knowledge about topic Subtle — behaves like new agent Normal — graph may not have data yet
Inbound queue full Interactions not processed, graph stops learning No Buffer in agent, drain on recovery
Inbound queue slow Extraction delayed, probes arrive late No Queue catches up during idle
Outbound queue unreachable No probes or starters delivered No — agent just doesn't probe Probes accumulate, delivered on recovery
Neo4j down All graph queries fail Yes — no personalization Supervisor restarts, agent uses fallback
Small LLM down Extraction pipeline stalls No (async) Queue buffers, processes on recovery
Large LLM down Inference and synthesis stall No (background) Retries at next scheduled window
Redis down All queues fail Partial — no probes, interactions buffer Supervisor restarts, agents use local buffer

2.3 Context Injection into Agent's LLM Prompt

When the Agent Runtime queries Experience Memory, the results need to be formatted and injected into the LLM's context. This is where the graph becomes actionable.

graph TB
    subgraph Agent builds LLM prompt
        SYS["System Prompt<br/>(agent personality, rules)"]
        EM_CTX["Experience Memory Context Block<br/>(injected from em_query results)"]
        CONV["Conversation History<br/>(recent turns)"]
        USER_MSG["Current User Message"]
    end

    SYS --> PROMPT["Assembled Prompt"]
    EM_CTX --> PROMPT
    CONV --> PROMPT
    USER_MSG --> PROMPT
    PROMPT --> LLM["LLM"]

    style EM_CTX fill:#d4edda,stroke:#155724

Context block format — structured, concise, confidence-annotated:

<experience_context>
  <entity name="Lena" relation="wife">
    <fact confidence="0.90" type="trait">Likes flowers</fact>
    <fact confidence="0.90" type="trait">Likes postcards</fact>
    <fact confidence="0.85" type="trait">Prefers red wine, especially Malbec</fact>
    <fact confidence="0.55" type="wish" expires="2026-12">May want a new kitchen chair set</fact>
    <fact confidence="0.99" type="state">Birthday: December 2</fact>
  </entity>
  <pending_probe topic="wine" priority="0.85">
    Lena's birthday is ~1 month away. Knowledge gap: does she enjoy wine subscriptions?
  </pending_probe>
  <active_reminders>
    Birthday reminder for Lena: fires November 25
  </active_reminders>
</experience_context>

The agent's system prompt includes instructions on how to use this context: prefer high-confidence facts, hedge when using low-confidence facts, never reveal raw confidence scores to the user, and integrate probes naturally into conversation when context fits.


3. Session Manager ↔ Experience Memory

The Session Manager routes messages from channel adapters to the Agent Runtime. It also provides session metadata that Experience Memory needs for context.

sequenceDiagram
    participant CH as Channel Adapter
    participant SM as Session Manager
    participant AR as Agent Runtime
    participant EM as Experience Memory

    CH->>SM: Incoming message from Telegram
    SM->>SM: Resolve session:<br/>user_id, channel, session_id,<br/>conversation_start, turn_count

    SM->>AR: Deliver message + session metadata

    AR->>EM: em_report_interaction({<br/>...,<br/>channel: "telegram",<br/>session_id: "sess_abc123",<br/>turn_number: 7,<br/>conversation_topic: "wine"})

    Note over EM: Channel and session metadata<br/>help EM understand context:<br/>- Telegram = casual channel<br/>- Turn 7 = mid-conversation<br/>- Topic = wine
Session Metadata Used By EM For
channel Adjusting probe formality (Slack = professional, Telegram = casual)
session_id Grouping interactions into episodes
turn_number Determining if now is appropriate for probing (not turn 1)
conversation_start Calculating conversation duration for episode records
user_timezone Scheduling starters and revision delivery
user_locale Language and cultural context for probes

4. Skill Orchestrator ↔ Experience Memory

Skills (MCP servers) don't talk to Experience Memory directly — they're sandboxed. But the Skill Orchestrator reports skill outcomes to EM, and EM can influence which skills get invoked and how.

sequenceDiagram
    participant AR as Agent Runtime
    participant SO as Skill Orchestrator
    participant SK as MCP Skill:<br/>Deal Monitor
    participant EM as Experience Memory
    participant IQ as Inbound Queue

    AR->>SO: Invoke Deal Monitor skill<br/>for kitchen chair deals

    SO->>SK: Execute: monitor({<br/>query: "kitchen chair set",<br/>budget: "under $500",<br/>notify_channel: "telegram"})

    SK-->>SO: Acknowledged: monitoring started

    Note over SK: Days later...

    SK->>SO: Result: Deal found!<br/>$299 at Wayfair, 40% off

    SO->>AR: Skill result: deal found
    SO->>IQ: Skill outcome event:{<br/>skill: "deal_monitor",<br/>task: "kitchen chairs for Lena",<br/>outcome: "success",<br/>result: {price: 299, store: "Wayfair"}}

    IQ->>EM: Process skill outcome
    EM->>EM: Update procedure:<br/>"Deal monitoring for gifts"<br/>success_rate += 1<br/>Link to episode

    AR->>AR: Format and deliver to user

Experience Memory Informing Skill Invocation

sequenceDiagram
    participant U as User
    participant AR as Agent Runtime
    participant EM as EM MCP Server
    participant SO as Skill Orchestrator
    participant SK as MCP Skill

    U->>AR: "Book me a flight to Tokyo"

    AR->>EM: em_query({entity: "User",<br/>relation: "travel_preferences"})
    EM-->>AR: {<br/>prefers: "direct flights",<br/>budget_style: "budget-conscious",<br/>seat_preference: "aisle",<br/>airline_loyalty: "ANA",<br/>confidence: 0.80}

    AR->>SO: Invoke flight search skill<br/>with EM-enriched parameters:<br/>{destination: "Tokyo",<br/>prefer_direct: true,<br/>prefer_airline: "ANA",<br/>seat: "aisle",<br/>sort_by: "price"}

    Note over AR: Without EM, agent would ask<br/>user all these questions.<br/>With EM, it already knows.

5. Control UI and Studio ↔ Experience Memory

The Control UI provides the user-facing interface for graph inspection, and Studio needs EM context for workflow design.

graph TB
    subgraph Control UI
        GRAPH_VIZ["Graph Visualizer<br/>Interactive node/edge explorer"]
        FACT_MGR["Fact Manager<br/>View, edit, delete facts"]
        CONF_PANEL["Confidence Dashboard<br/>Edge confidence distribution"]
        ACTIVITY["Activity Log<br/>Recent extractions,<br/>probes delivered"]
        SETTINGS["EM Settings<br/>Configuration knobs"]
    end

    subgraph Studio
        WF_CTX["Workflow Context<br/>Inject EM knowledge<br/>into workflow conditions"]
        WF_TRIGGER["Workflow Triggers<br/>Fire workflow when<br/>graph state changes"]
    end

    subgraph EM API
        MCP_Q["em_query"]
        MCP_SNAP["em_graph_snapshot"]
        MCP_DEL["em_delete_node/edge"]
        MCP_PROV["em_get_provenance"]
        EVENTS["Event Stream<br/>(WebSocket)"]
    end

    GRAPH_VIZ -->|"Read"| MCP_SNAP
    FACT_MGR -->|"Read/Delete"| MCP_DEL
    FACT_MGR -->|"Why?"| MCP_PROV
    CONF_PANEL -->|"Stats"| MCP_SNAP
    ACTIVITY -->|"Stream"| EVENTS
    SETTINGS -->|"Config API"| EM_CONFIG["em_update_config"]

    WF_CTX -->|"Read"| MCP_Q
    WF_TRIGGER -->|"Subscribe"| EVENTS

    style GRAPH_VIZ fill:#d4edda,stroke:#155724
    style SETTINGS fill:#fff3cd,stroke:#856404
    style WF_TRIGGER fill:#cce5ff,stroke:#004085

Studio Workflow Trigger Example

A user designs a workflow in Studio: "When someone I know has a birthday within 14 days AND I don't have a gift planned, start the gift research procedure."

graph LR
    TRIGGER["Graph Trigger:<br/>(:Person)-[:BIRTHDAY_WITHIN]->(14 days)<br/>AND NOT (:Person)-[:HAS_GIFT_PLAN]->()"]

    TRIGGER --> COND["Condition:<br/>Person is in user's<br/>inner circle<br/>(confidence > 0.8)"]

    COND --> ACTION["Action:<br/>1. Query EM for person's likes<br/>2. Spawn deal monitor skill<br/>3. Create reminder 7 days before"]

    style TRIGGER fill:#fff3cd,stroke:#856404
    style ACTION fill:#d4edda,stroke:#155724

6. Voice Pipeline ↔ Experience Memory

The STT-TTS pipeline introduces unique challenges: speech is noisier than text, and voice conversations tend to be more casual and faster-paced.

sequenceDiagram
    participant U as User (Voice)
    participant STT as STT Engine
    participant SM as Session Manager
    participant AR as Agent Runtime
    participant EM as Experience Memory
    participant TTS as TTS Engine

    U->>STT: [Speech audio]
    STT->>SM: Transcript: "Hey can you remind me<br/>to grab flowers for Lena<br/>on my way home?"
    STT->>SM: Metadata: {<br/>confidence: 0.92,<br/>language: "en",<br/>noise_level: "low",<br/>speaking_rate: "fast"}

    SM->>AR: Deliver + metadata

    AR->>EM: em_query({entity: "Lena"})
    EM-->>AR: {relation: "wife",<br/>likes: ["flowers"], ...}

    AR->>AR: Generate response
    AR->>TTS: "Got it — I'll remind you<br/>to pick up flowers for Lena<br/>when you're heading home.<br/>Any particular kind she likes?"

    TTS->>U: [Speech audio]

    AR->>EM: em_report_interaction({<br/>...,<br/>channel: "voice",<br/>stt_confidence: 0.92,<br/>speaking_rate: "fast"})

    Note over EM: Voice metadata affects extraction:<br/>- Lower STT confidence → lower fact confidence<br/>- Fast speaking rate → more likely to contain<br/>  casual/imprecise statements<br/>- Voice channel → adjust probe tone to conversational
Voice-Specific Consideration How EM Handles It
STT transcription errors Reduce extraction confidence proportionally to STT confidence score
Casual/imprecise language Widen hedging detection thresholds
Faster conversation pace Reduce probe frequency (1 per 5 turns instead of 1 per conversation)
No visual UI available Probes must be short and conversational
Ambient context (driving, cooking) Adjust starter urgency thresholds based on detected activity

7. Event Sources ↔ Experience Memory

The Event Monitor polls external sources and cross-references against the knowledge graph. Integration points:

Event Source Polling Method Data Format Relevance Filter
Weather API REST poll every 30 min JSON (location, severity, timeframe) Match against user's lives_in, planning
News API REST poll every 15 min JSON (headlines, topics, entities) Match against user's interested_in, works_at
Market data WebSocket stream (if enabled) JSON (ticker, price, change) Match against user's invested_in, tracks
Calendar (user's own) CalDAV sync every 5 min iCal events Upcoming events within N-day lookahead
Email digest (if email skill enabled) Skill reports summaries Structured JSON Match against known entities and active tasks
graph TB
    subgraph External Sources
        WEATHER["Weather API<br/>Poll: 30 min"]
        NEWS["News API<br/>Poll: 15 min"]
        MARKET["Market Data<br/>WebSocket stream"]
        CAL["Calendar<br/>CalDAV: 5 min"]
    end

    subgraph Event Monitor
        POLL["Poller / Subscriber"]
        NORM["Event Normalizer<br/>Canonical event schema"]
        REL["Relevance Filter<br/>Graph-backed scoring"]
    end

    subgraph Proactive Engine
        PE["Trigger Evaluation"]
        OQ["Outbound Queue"]
    end

    WEATHER --> POLL
    NEWS --> POLL
    MARKET --> POLL
    CAL --> POLL

    POLL --> NORM --> REL
    REL -->|"Score > 0.50"| PE
    REL -->|"Score < 0.50"| DISCARD["Discard"]
    PE --> OQ

    style REL fill:#fff3cd,stroke:#856404
    style PE fill:#d4edda,stroke:#155724

Part II: Testing Strategy


8. Testing Pyramid

Experience Memory requires testing at five levels, each catching different classes of bugs.

graph TB
    subgraph Level 5: Experience Quality Evaluation
        L5["Does the agent feel like it<br/>knows the user after 50 conversations?<br/>Human evaluation + LLM-as-judge"]
    end

    subgraph Level 4: Scenario Simulation
        L4["Multi-conversation scenarios<br/>running compressed timelines.<br/>Verify emergent behavior."]
    end

    subgraph Level 3: Integration Tests
        L3["Component-to-component:<br/>Agent → EM → Neo4j → Outbound Queue<br/>Full data flow verification."]
    end

    subgraph Level 2: Component Tests
        L2["Extraction Pipeline, Graph Diff Engine,<br/>Proactive Engine, Background Workers<br/>tested in isolation with mocked deps."]
    end

    subgraph Level 1: Unit Tests
        L1["Individual functions:<br/>confidence scoring formula,<br/>hedge detection, temporal parsing,<br/>Cypher query builders."]
    end

    L1 --> L2 --> L3 --> L4 --> L5

    style L1 fill:#d4edda,stroke:#155724
    style L2 fill:#d4edda,stroke:#155724
    style L3 fill:#fff3cd,stroke:#856404
    style L4 fill:#cce5ff,stroke:#004085
    style L5 fill:#e8daef,stroke:#6c3483

9. Level 1 — Unit Tests

Component Test Focus Example Test Cases
Confidence scorer Formula correctness explicit + no_hedge + strong_sentiment = 0.90 × 1.0 × 1.0 = 0.90
inferred + moderate_hedge + neutral = 0.45 × 0.65 × 0.80 = 0.234
Boundary conditions confidence never exceeds 1.0 after reinforcement
confidence never goes below 0.0 after decay
Hedge detector Keyword classification "loves" → no_hedge, "may like" → moderate_hedge
Edge cases "I don't think she hates it" → complex negation handling
Temporal parser Type classification "this year" → wish, expires: 2026-12-31
"always" → trait, no expiry
"last Tuesday" → episode, occurred_at: 2026-02-10
Relative dates "next month" → correct absolute date
Confidence decay Math correctness 0.85 × (1 - 0.01) = 0.8415 after 1 cycle
Edge archived when confidence < 0.15
Cypher builders Query correctness query_entity("Lena", min_confidence=0.5) → valid Cypher
Injection safety entity name with quotes/special chars → escaped

Test count target: ~200 unit tests covering all formula paths and edge cases.


10. Level 2 — Component Tests

Each major component tested in isolation with mocked dependencies.

10.1 Extraction Pipeline Tests

graph LR
    INPUT["Test Input:<br/>Curated conversation<br/>transcripts with<br/>known entities"] --> EP["Extraction<br/>Pipeline"]

    MOCK_LLM["Mock Small LLM<br/>Returns predetermined<br/>structured JSON"] --> EP

    EP --> OUTPUT["Pipeline Output:<br/>Extracted edges<br/>with scores"]

    OUTPUT --> ASSERT["Assertions:<br/>- Correct entities found<br/>- Relations properly typed<br/>- Confidence scores in range<br/>- Temporal types assigned<br/>- Hedges detected"]

    style INPUT fill:#cce5ff,stroke:#004085
    style MOCK_LLM fill:#e8daef,stroke:#6c3483
    style ASSERT fill:#d4edda,stroke:#155724

Test corpus structure:

Category Input Example Expected Extraction
Explicit fact "My wife's name is Lena" Person(Lena) →wife_of→ User, confidence: 0.90
Hedged preference "She may like kitchen chairs" Lena →may_want→ Kitchen chairs, confidence: 0.55, type: wish
Strong preference "She absolutely loves Malbec" Lena →loves→ Malbec, confidence: 0.95, type: trait
Negation "She doesn't drink beer" Lena →dislikes→ Beer, confidence: 0.85, type: trait
Indirect inference "Pick up the kids from school" User →has_children→ Children, confidence: 0.85, source: inference
Temporal "We're going to Tokyo in March" User →planning→ Tokyo trip, type: wish, dates: March 2026
Contradiction "Actually she's turning 46, not 47" Correction: Lena →age→ 46, revise previous
No new info "Thanks, that's helpful" No graph mutation
Complex sentence "My wife would kill me if I forgot again, she's turning 47 in December" Multiple: relationship confirmed, negative episode, age inferred, birthday month

Minimum test corpus: 100 labeled examples across all categories, with at least 10 per category.

10.2 Graph Diff Engine Tests

graph LR
    subgraph Test Setup
        SEED["Seed Neo4j<br/>with known graph state"]
        INPUT_E["Input edges<br/>from extraction pipeline"]
    end

    subgraph Execution
        GDE["Graph Diff Engine"]
    end

    subgraph Assertions
        A1["INSERT: new edge created<br/>with correct properties"]
        A2["REINFORCE: confidence boosted,<br/>last_reinforced updated"]
        A3["CONTRADICT: conflict flagged,<br/>resolution correct"]
        A4["SKIP: no graph mutation<br/>when no new info"]
        A5["MERGE: existing edge enriched<br/>without duplication"]
    end

    SEED --> GDE
    INPUT_E --> GDE
    GDE --> A1
    GDE --> A2
    GDE --> A3
    GDE --> A4
    GDE --> A5

    style SEED fill:#cce5ff,stroke:#004085
    style GDE fill:#fff3cd,stroke:#856404
    style A1 fill:#d4edda,stroke:#155724
Scenario Seed State Input Expected Operation
Brand new entity Empty graph Lena →wife_of→ User INSERT both nodes + edge
Known entity, new relation Lena exists Lena →likes→ Malbec INSERT edge only
Reinforcement Lena →likes→ Flowers (0.70) Lena →likes→ Flowers (0.90) REINFORCE: confidence → 0.82
Contradiction Lena →age→ 47 (0.80) Lena →age→ 46 (0.90) CONTRADICT → REVISE to 46
More specific Lena →likes→ Wine (0.85) Lena →prefers→ Malbec (0.85) MERGE: add specific edge, keep general
Duplicate Lena →wife_of→ User (0.95) Lena →wife_of→ User (0.90) REINFORCE (not duplicate)
Expired wish Kitchen chairs (expired 2025-12) Kitchen chairs mentioned again INSERT new wish with new expiry

10.3 Proactive Engine Tests

Scenario Input Trigger Graph State Expected Output
Auto-execute DOB discovered Lena DOB: Dec 2 (0.99) Auto-create birthday reminder
Suggest, high confidence Deal found for known interest Lena →likes→ Malbec (0.90) + deal Suggestion in outbound queue
Suggest, low confidence Deal found for hedged wish Lena →may_want→ Chairs (0.55) + deal Casual suggestion in queue
Defer Possible gift idea, low confidence Lena →may_want→ something (0.30) Defer, queue for reinforcement
Probe generated Knowledge gap + matching context wine conversation + no wife wine pref Probe in outbound queue
Probe suppressed Knowledge gap + no matching context coding conversation + no wife wine pref Probe stays in queue
Frequency limit 3 probes already delivered this week Any trigger Suppress, respect limit
Timing suppressed Good probe, but 2am user time Any trigger Queue with earliest_delivery constraint

11. Level 3 — Integration Tests

Full data flow tests with real Neo4j (test instance) and real Redis, but mocked LLMs.

graph TB
    subgraph Test Harness
        DRIVER["Test Driver<br/>Simulates agent interactions"]
        MOCK_AG["Mock Agent Runtime<br/>Sends interactions,<br/>pulls probes"]
    end

    subgraph Real Services (Dockerized)
        EM["Experience Memory<br/>Service"]
        NEO_T["Neo4j<br/>(test instance)"]
        REDIS_T["Redis<br/>(test instance)"]
        VEC_T["Vector Store<br/>(test instance)"]
    end

    subgraph Mocked Services
        LLM_MOCK["LLM Mock Server<br/>Deterministic responses<br/>from test corpus"]
    end

    DRIVER --> MOCK_AG
    MOCK_AG --> EM
    EM --> NEO_T
    EM --> REDIS_T
    EM --> VEC_T
    EM --> LLM_MOCK

    DRIVER -->|"Assert graph state"| NEO_T
    DRIVER -->|"Assert queue contents"| REDIS_T

    style DRIVER fill:#cce5ff,stroke:#004085
    style NEO_T fill:#d4edda,stroke:#155724
    style REDIS_T fill:#d4edda,stroke:#155724
    style LLM_MOCK fill:#e8daef,stroke:#6c3483

Integration Test Scenarios

Test Name Steps Assertions
Full extraction flow Send 1 interaction → wait for extraction → check graph Correct nodes/edges in Neo4j
Reinforcement over 3 turns Send 3 interactions mentioning same preference Edge confidence increases monotonically
Contradiction and revision Send fact, then contradicting fact Old edge revised, new edge created with higher confidence
Probe generation and delivery Create knowledge gap, send matching-context interaction Probe appears in outbound queue with correct context tags
Probe delivery timing Create probe, send 5 interactions with wrong context Probe NOT delivered until matching context
Background decay Seed edges, trigger decay worker Confidence decreased by correct amount
Background archival Seed low-confidence edges, trigger archival Edges below threshold removed
Inference chain Seed multi-hop graph, trigger inference worker New low-confidence edges created connecting distant nodes
Event-driven starter Seed graph with location + event, inject weather alert Starter in outbound queue with correct timing constraints
Skill outcome reporting Simulate skill completion event Episode created, procedure success_rate updated
Failure recovery Kill Neo4j mid-extraction, restart Extraction retries, no data loss, agent buffer drained

Test infrastructure: Docker Compose with Neo4j, Redis, Vector Store, LLM mock server. Tests run in CI on every PR.


12. Level 4 — Scenario Simulation

This is the critical testing level. We need to verify that Experience Memory produces the right emergent behavior over time, not just correct individual operations.

12.1 Scenario Simulator

graph TB
    subgraph Scenario Definition
        PERSONA["User Persona<br/>Predefined personality,<br/>preferences, life situation"]
        SCRIPT["Conversation Script<br/>50-100 interactions<br/>spanning simulated weeks"]
        EXPECTED["Expected Graph State<br/>What EM should know<br/>after all interactions"]
    end

    subgraph Simulator
        SIM["Scenario Simulator<br/>Feeds interactions with<br/>simulated timestamps"]
        TIME["Time Warp<br/>Compress weeks<br/>into minutes.<br/>Trigger decay/revision<br/>between simulated days"]
    end

    subgraph Evaluation
        GRAPH_CMP["Graph Comparison<br/>Expected vs actual<br/>nodes, edges, confidence"]
        PROBE_EVAL["Probe Evaluation<br/>Were probes generated<br/>at the right moments?"]
        STARTER_EVAL["Starter Evaluation<br/>Were starters relevant<br/>and well-timed?"]
        FALSE_POS["False Positive Check<br/>Any incorrect facts<br/>stored with high confidence?"]
    end

    PERSONA --> SIM
    SCRIPT --> SIM
    SIM --> TIME
    TIME --> GRAPH_CMP
    TIME --> PROBE_EVAL
    TIME --> STARTER_EVAL
    TIME --> FALSE_POS
    EXPECTED --> GRAPH_CMP

    style SIM fill:#cce5ff,stroke:#004085
    style TIME fill:#fff3cd,stroke:#856404
    style FALSE_POS fill:#fdd,stroke:#c00

12.2 Test Personas

Persona Profile Key Test Focus
Alex the CTO Married (Lena), 2 kids, works at FinTech startup, codes in Python, travels quarterly Full lifecycle: family, work, coding preferences, travel patterns
Maya the Freelancer Single, graphic designer, multiple clients, irregular schedule, budget-conscious Temporal patterns with irregular hours, multi-client context switching
James the Retiree Widower, hobby gardener, follows markets, has grandchildren, less tech-savvy Gentle probing, simpler language, health/wellness sensitivity
Sara the Student University, part-time job, roommates, limited budget, social media active Fast-changing interests, budget constraints, social context

12.3 Scenario Script Example — "Alex the CTO"

# Week 1
Turn 1: "Help me write a Python script to parse CSV files"
  → EM should extract: User prefers Python (explicit)
Turn 2: "Use type hints please, I hate untyped code"
  → EM should extract: strong preference for type hints (explicit, high confidence)
Turn 3: "Can you move my Thursday standup to Friday?"
  → EM should infer: has recurring standup (indirect inference)

# Week 2
Turn 4: "I want to buy a present for my wife"
  → EM should detect: knowledge gap about wife
Turn 5: "Her name is Lena. Birthday is December 2, 1979"
  → EM should create: Person(Lena), wife relation, DOB
  → EM should auto-suggest: birthday reminder
Turn 6: "She likes flowers and postcards. Maybe kitchen chairs this year"
  → EM should create: traits (flowers, postcards) + wish (chairs)
  → EM should spawn: deal monitoring if skill available

# Week 3
Turn 7: [wine conversation]
  → EM should surface probe: "Does Lena enjoy wine?"
Turn 8: "She loves reds, especially Malbec"
  → EM should create: wine/Malbec preferences for Lena

# Week 4 (simulated) — Background revision runs
  → EM should verify: Acme Corp still exists (public fact)
  → EM should run inference: Malbec → Argentina → user's travel interest?
  → EM should decay: any edges not reinforced

# Week 5
Turn 9: "We're planning a trip to Tokyo in March"
  → EM should create: Tokyo trip, March dates
  → EM should enrich: travel preferences from past patterns
Turn 10: "Book direct flights, I always prefer direct"
  → EM should reinforce: direct flight preference

12.4 Evaluation Metrics

Metric Target Measurement
Precision — facts stored are correct > 95% Compare graph to ground truth persona
Recall — facts mentioned are captured > 85% Count of expected facts found in graph
Confidence calibration — confidence scores match actual correctness Calibration error < 0.10 Compare confidence to binary correct/incorrect
False positive rate — incorrect facts stored at high confidence < 2% Count of wrong facts with confidence > 0.70
Probe relevance — probes generated at appropriate moments > 80% relevant Human evaluation of probe timing and context fit
Probe naturalness — probes feel conversational, not interrogative > 85% natural Human evaluation / LLM-as-judge
Inference accuracy — overnight inferences are reasonable > 60% useful Human evaluation of inferred connections
Contradiction handling — corrections properly applied 100% All explicit corrections reflected in graph

13. Level 5 — Experience Quality Evaluation

The highest-level test: does the agent feel like it knows the user?

13.1 LLM-as-Judge Evaluation

sequenceDiagram
    participant SIM as Scenario<br/>Simulator
    participant EM as Experience<br/>Memory
    participant AR as Agent Runtime<br/>(with EM)
    participant AR_B as Agent Runtime<br/>(without EM, baseline)
    participant JUDGE as LLM Judge<br/>(evaluator)

    SIM->>EM: Run 50-interaction scenario
    SIM->>AR: Present test prompt
    AR->>AR: Generate response with EM context
    SIM->>AR_B: Present same test prompt
    AR_B->>AR_B: Generate response without EM

    SIM->>JUDGE: "Here is a user profile.<br/>Here are two agent responses<br/>to the same question.<br/>Which response demonstrates<br/>better understanding of the user?<br/>Rate on 5 dimensions."

    JUDGE-->>SIM: Evaluation scores

13.2 Evaluation Dimensions

Dimension What It Measures Example
Personalization Does the response reflect user's known preferences? Agent suggests Python with type hints without being asked
Anticipation Does the agent predict needs before being asked? Agent mentions upcoming birthday during gift conversation
Consistency Does the agent maintain coherent knowledge across topics? Wife's name is always Lena, never confused
Naturalness Do probes and suggestions feel organic? Wine → birthday probe feels like a thoughtful friend
Restraint Does the agent avoid overstepping or being creepy? Agent doesn't volunteer private info unprompted

13.3 A/B Comparison Framework

Test Agent A (with EM) Agent B (baseline) Evaluator
Same-session personalization Uses graph context Uses conversation history only LLM judge
Cross-session knowledge Remembers facts from weeks ago Each session starts fresh LLM judge
Proactive suggestions Offers birthday reminder, deal alerts Only responds to explicit requests Human evaluation
Preference-aware skill invocation Passes EM preferences to skills Asks user every time Task completion speed
Error recovery after correction Updates graph, never repeats mistake May repeat error in future sessions Correctness tracking

Part III: Configuration


14. Configuration Philosophy

Sensible defaults. Minimal required configuration. Expert overrides available.

The configuration is organized into three tiers:

Tier Audience Changed When Examples
Defaults Nobody (baked in) Never by user, only by SecretAI engineering Confidence formula weights, graph schema
Profile Every user (on setup) During onboarding or in Settings Timezone, language, preferred channels, proactivity level
Expert Power users / developers Via Control UI or config file Decay rates, probe frequency, inference schedule

15. Configuration Schema

# experience_memory.yaml
# SecretAI Rails — Experience Memory Configuration

# ============================================================
# PROFILE TIER — User-facing settings
# ============================================================
profile:
  # User timezone — affects scheduling of starters, revision, probes
  timezone: "America/Los_Angeles"

  # Primary language for probe generation
  language: "en"

  # How proactive should the agent be?
  # conservative: Only suggest when explicitly relevant, minimal probing
  # balanced: Contextual probes, moderate starters (DEFAULT)
  # proactive: Frequent probes, more conversation starters, anticipatory
  proactivity_level: "balanced"

  # Channels where agent may initiate conversation (starters)
  # Agent will never initiate on channels not listed here
  starter_channels:
    - "telegram"
    - "whatsapp"

  # Do Not Disturb hours — no agent-initiated contact
  quiet_hours:
    start: "22:00"
    end: "07:00"

  # Should the agent explain why it knows something when asked?
  # If false, agent says "I remember from our conversations"
  # If true, agent traces provenance chain
  explain_knowledge: true

# ============================================================
# EXTRACTION — Controls how facts are extracted from conversation
# ============================================================
extraction:
  # Enable/disable indirect inference (Pattern 4)
  # Some users may prefer only explicit fact storage
  enable_indirect_inference: true

  # Minimum confidence threshold for storing an extracted fact
  # Below this, the extraction is discarded
  min_storage_confidence: 0.25

  # Maximum entities to extract per message
  # Prevents runaway extraction on very long messages
  max_entities_per_message: 20

  # Maximum relations to extract per message
  max_relations_per_message: 30

  # Enable/disable sentiment analysis on extractions
  enable_sentiment: true

  # STT confidence multiplier — reduce extraction confidence
  # when speech-to-text confidence is low
  stt_confidence_floor: 0.70  # Below this STT score, skip extraction entirely
  stt_confidence_scale: true  # Scale extraction confidence by STT confidence

# ============================================================
# CONFIDENCE — Controls the confidence scoring model
# ============================================================
confidence:
  # Base confidence scores by acquisition mechanism
  base_scores:
    explicit: 0.90    # User directly states a fact
    observational: 0.65 # Pattern detected from behavior
    inferential: 0.45   # Cross-context inference
    reflective: 0.50    # Default for outcome-based learning

  # Hedge multipliers
  hedge_multipliers:
    none: 1.00          # "loves", "always"
    mild: 0.90          # "likes", "usually"
    moderate: 0.65       # "may", "might"
    strong: 0.50         # "I think", "maybe", "possibly"

  # Reinforcement boost per additional episode
  reinforcement_boost: 0.08

  # Maximum confidence (hard cap)
  max_confidence: 0.99

  # Minimum confidence before archival
  archive_threshold: 0.15

  # Trait edges have a slower decay (protected)
  trait_decay_protection: 0.5  # Multiply decay rate by this for traits

# ============================================================
# DECAY — Controls how knowledge ages
# ============================================================
decay:
  # Default decay rate per cycle (monthly)
  default_rate: 0.02  # 2% per month

  # Decay rates by temporal type (override default)
  rates_by_type:
    trait: 0.005        # Very slow — near-permanent
    state: 0.00         # No time decay, only contradictions
    wish: 0.04          # Faster decay — time-bounded desires
    episode: 0.08       # Fast decay — one-time events

  # Decay cycle frequency
  cycle_frequency: "weekly"  # "daily" | "weekly" | "monthly"

  # Grace period — no decay for N days after last reinforcement
  grace_period_days: 30

# ============================================================
# PROBING — Controls contextual probing behavior
# ============================================================
probing:
  # Maximum probes per conversation
  max_probes_per_conversation: 1

  # Maximum probes per day
  max_probes_per_day: 3

  # Maximum probes per week
  max_probes_per_week: 10

  # Minimum conversation turn before probing
  # Don't probe in the first N turns of a conversation
  min_turn_for_probe: 3

  # Minimum context-fit score to deliver a probe
  min_context_fit: 0.70

  # Retry limit — after N failed context matches, lower priority
  max_probe_attempts: 5

  # Cooldown after user ignores a probe (days)
  ignore_cooldown_days: 7

  # Cooldown after user deflects a probe (days)
  deflect_cooldown_days: 14

# ============================================================
# STARTERS — Controls agent-initiated conversations
# ============================================================
starters:
  # Enable/disable conversation starters entirely
  enabled: true

  # Maximum starters per day (across all types)
  max_per_day: 3

  # Maximum starters per week
  max_per_week: 10

  # Minimum relevance score to initiate contact
  min_relevance: 0.50

  # Per-type limits and settings
  types:
    alert:
      enabled: true
      max_per_day: 5          # Alerts can exceed normal limits
      min_relevance: 0.40     # Lower threshold for safety-related
      override_quiet_hours: true  # Weather warnings during DND

    opportunity:
      enabled: true
      max_per_day: 2
      min_relevance: 0.60
      override_quiet_hours: false

    revision:
      enabled: true
      max_per_day: 1
      min_relevance: 0.50
      override_quiet_hours: false

    insight:
      enabled: true
      max_per_week: 2         # Weekly limit, not daily
      min_relevance: 0.70
      override_quiet_hours: false

    anticipation:
      enabled: true
      lookahead_days: 10      # How far ahead to check
      min_relevance: 0.50
      override_quiet_hours: false

# ============================================================
# RISK MODEL — Controls proactive suggestion thresholds
# ============================================================
risk_model:
  # Auto-execute threshold: high confidence AND low cost
  auto_execute:
    min_confidence: 0.90
    max_cost_category: "none"  # "none" | "low" | "medium" | "high"

  # Suggest threshold: moderate+ confidence AND low-medium cost
  suggest:
    min_confidence: 0.50
    max_cost_category: "medium"

  # Casual mention threshold: any confidence, low cost
  casual_mention:
    min_confidence: 0.30
    max_cost_category: "low"

  # Defer threshold: everything else
  # (implicit — anything not matching above is deferred)

  # Cost categories for action types
  action_costs:
    create_reminder: "none"
    suggest_product: "low"
    spawn_monitoring_task: "low"
    send_message_on_behalf: "high"
    make_purchase: "high"
    modify_calendar: "medium"
    share_information: "medium"

# ============================================================
# BACKGROUND — Controls revision and inference workers
# ============================================================
background:
  # Revision schedule (cron syntax, in user's timezone)
  revision_schedule: "0 2 * * *"    # 2:00 AM daily

  # Inference schedule
  inference_schedule: "0 3 * * *"   # 3:00 AM daily

  # Episode clustering schedule
  clustering_schedule: "0 4 * * 0"  # 4:00 AM Sundays

  # Maximum revision batch size (edges per run)
  revision_batch_size: 100

  # Maximum inference chain depth (hops)
  max_inference_depth: 3

  # Maximum inferred edges per run
  max_inferred_edges: 20

  # Public fact verification — use web search to verify
  enable_public_fact_verification: true

  # Maximum web searches per revision cycle
  max_verification_searches: 10

# ============================================================
# EVENT MONITOR — Controls external event monitoring
# ============================================================
event_monitor:
  # Enable/disable entire event monitoring
  enabled: true

  # Source-specific settings
  sources:
    weather:
      enabled: true
      poll_interval_minutes: 30
      severity_threshold: "warning"  # "watch" | "warning" | "emergency"

    news:
      enabled: true
      poll_interval_minutes: 15
      max_articles_per_poll: 20

    market:
      enabled: false             # Opt-in only
      # poll_interval_minutes: 5
      # change_threshold_percent: 2.0

    calendar:
      enabled: true
      sync_interval_minutes: 5
      lookahead_days: 14

# ============================================================
# LLM — Controls LLM usage for extraction and inference
# ============================================================
llm:
  # Small LLM for extraction pipeline
  small:
    provider: "anthropic"       # "anthropic" | "openai" | "local"
    model: "claude-haiku-4-5-20251001"
    max_tokens: 1024
    temperature: 0.1            # Low for structured extraction
    timeout_ms: 2000
    retry_count: 2
    fallback: "skip"            # "skip" | "queue_retry" — what to do on failure

  # Large LLM for inference and synthesis
  large:
    provider: "anthropic"
    model: "claude-sonnet-4-5-20250929"
    max_tokens: 4096
    temperature: 0.3            # Slightly higher for creative inference
    timeout_ms: 30000
    retry_count: 1
    fallback: "queue_retry"

  # Token budget limits (per day)
  daily_token_budget:
    small: 500000              # ~500K tokens/day for extraction
    large: 100000              # ~100K tokens/day for inference

# ============================================================
# STORAGE — Controls graph database and vector store
# ============================================================
storage:
  neo4j:
    uri: "bolt://localhost:7687"
    database: "experience_memory"
    max_connections: 20
    query_timeout_ms: 5000

  vector_store:
    provider: "qdrant"          # "qdrant" | "chroma"
    uri: "http://localhost:6333"
    collection: "episodes"
    embedding_model: "text-embedding-3-small"
    embedding_dimensions: 1536

  queue:
    provider: "redis"           # "redis" | "nats"
    uri: "redis://localhost:6379"
    inbound_queue: "em:inbound"
    outbound_queue: "em:outbound"
    max_queue_size: 10000
    ttl_hours: 72               # Messages expire after 72h

# ============================================================
# PRIVACY — Controls data classification and sharing
# ============================================================
privacy:
  # Enable/disable experience sharing with other agents
  sharing_enabled: false        # Opt-in only

  # Minimum privacy level for shared data
  sharing_min_level: "L1"       # Only L0 and L1 shared

  # Enable/disable differential privacy on exports
  differential_privacy: true

  # Epsilon for differential privacy (lower = more private)
  dp_epsilon: 1.0

  # Auto-classify PII in extracted entities
  auto_pii_detection: true

  # Retention period for archived (low-confidence) edges
  archive_retention_days: 365

# ============================================================
# OBSERVABILITY — Controls metrics and logging
# ============================================================
observability:
  # Metrics export
  metrics:
    enabled: true
    export_interval_seconds: 60
    endpoint: "http://localhost:9090/metrics"  # Prometheus

  # Audit logging
  audit_log:
    enabled: true
    log_extractions: true       # Log every extraction operation
    log_graph_mutations: true   # Log every INSERT/REINFORCE/CONTRADICT
    log_probes_delivered: true  # Log every probe sent to user
    log_starters_delivered: true
    retention_days: 90

  # Health check
  health_check:
    interval_seconds: 30
    neo4j_timeout_ms: 1000
    redis_timeout_ms: 500
    vector_store_timeout_ms: 1000

16. Configuration Profiles — Proactivity Presets

The proactivity_level setting maps to a coherent set of overrides:

Setting Conservative Balanced (Default) Proactive
probing.max_probes_per_conversation 0 1 2
probing.max_probes_per_week 3 10 20
probing.min_context_fit 0.90 0.70 0.50
probing.min_turn_for_probe 5 3 2
starters.enabled false true true
starters.max_per_day 0 3 5
starters.max_per_week 0 10 25
starters.min_relevance N/A 0.50 0.35
extraction.enable_indirect_inference false true true
background.max_inferred_edges 0 20 50
risk_model.auto_execute.min_confidence 0.99 0.90 0.80

Users select a preset during onboarding. Power users can override individual settings.

graph LR
    ONBOARD["Onboarding:<br/>'How proactive should<br/>your agent be?'"]

    ONBOARD -->|"Conservative"| C["Minimal probing<br/>No starters<br/>No inference<br/>Only explicit facts"]
    ONBOARD -->|"Balanced"| B["Contextual probing<br/>Moderate starters<br/>Inference enabled<br/>Standard thresholds"]
    ONBOARD -->|"Proactive"| P["Frequent probing<br/>Active starters<br/>Deep inference<br/>Lower thresholds"]

    C --> OVERRIDE["User can override<br/>any individual setting<br/>in Control UI"]
    B --> OVERRIDE
    P --> OVERRIDE

    style B fill:#d4edda,stroke:#155724
    style OVERRIDE fill:#fff3cd,stroke:#856404

17. Runtime Configuration Hot-Reload

Some configuration changes should take effect immediately without restarting the EM service. Others require a restart.

Config Section Hot-Reloadable Why
profile.* ✅ Yes User preferences change frequently
probing.* ✅ Yes User may want to adjust probing mid-conversation
starters.* ✅ Yes User may want to mute starters temporarily
risk_model.* ✅ Yes Threshold tuning shouldn't require restart
confidence.base_scores ✅ Yes Tuning during evaluation
decay.* ✅ Yes Affects next decay cycle
extraction.* ⚠️ Partial enable_indirect_inference is hot; max_entities requires pipeline restart
background.*.schedule ❌ No Cron schedules registered at startup
llm.* ❌ No Model changes require reconnection
storage.* ❌ No Database connections established at startup
sequenceDiagram
    participant UI as Control UI
    participant API as EM Config API
    participant EM as EM Service
    participant CACHE as Config Cache

    UI->>API: PUT /config {probing: {max_per_day: 5}}
    API->>API: Validate against schema
    API->>CACHE: Update in-memory config
    API->>EM: Signal: CONFIG_CHANGED
    EM->>CACHE: Reload affected sections
    EM-->>API: ACK: config updated

    Note over EM: Next probe check uses<br/>new max_per_day = 5.<br/>No restart needed.

18. Configuration Validation Rules

Rule Scope Error on Violation
probing.max_probes_per_dayprobing.max_probes_per_week Cross-field Config rejected
starters.max_per_daystarters.max_per_week Cross-field Config rejected
confidence.archive_threshold < extraction.min_storage_confidence Cross-field Warning (facts stored but immediately at risk of archival)
decay.grace_period_days > 0 Single field Config rejected
llm.small.temperature < 0.5 Single field Warning (higher temp reduces extraction reliability)
llm.daily_token_budget.small > 0 Single field Config rejected
privacy.dp_epsilon > 0 Single field Config rejected
quiet_hours.startquiet_hours.end Cross-field Warning (empty quiet hours)
All starters.types.*.min_relevance ≥ 0.0 and ≤ 1.0 Range Config rejected
background.max_inference_depth ≤ 5 Range Warning (deep chains are expensive and speculative)

19. Per-Entity Configuration Overrides

Some entities may need different treatment than the defaults. For example, the user might want aggressive monitoring for work-related entities but conservative handling for personal/family entities.

# Entity-level overrides (stored in Neo4j as node properties)
entity_overrides:
  - entity_match: "Lena"
    overrides:
      probing:
        max_probes_per_week: 2    # Be gentle about probing wife-related topics
      risk_model:
        suggest:
          min_confidence: 0.70     # Higher bar for suggestions involving wife

  - entity_match: "Acme Corp"
    overrides:
      starters:
        types:
          news:
            enabled: true
            min_relevance: 0.30    # Lower threshold — user wants all company news

  - entity_match: "health"
    entity_type: "concept"
    overrides:
      privacy:
        sharing_min_level: "L4"    # Never share health-related knowledge
      extraction:
        enable_indirect_inference: false  # Don't infer health facts

20. Monitoring Configuration Effectiveness

The configuration needs its own feedback loop. SecretAI should track how configuration choices affect user engagement and satisfaction.

graph TB
    subgraph Configuration Metrics
        M1["Probe acceptance rate<br/>% of probes user responds to"]
        M2["Starter engagement rate<br/>% of starters that lead to conversation"]
        M3["Probe annoyance signal<br/>User ignores/deflects/mutes"]
        M4["Graph growth rate<br/>New edges per week"]
        M5["Correction rate<br/>How often user says 'that's wrong'"]
        M6["Token efficiency<br/>Useful extractions per token spent"]
    end

    subgraph Signals
        M1 -->|"Low acceptance"| S1["Consider: probes not relevant enough<br/>→ raise min_context_fit"]
        M2 -->|"Low engagement"| S2["Consider: starters not valuable enough<br/>→ raise min_relevance"]
        M3 -->|"High annoyance"| S3["Consider: too many probes<br/>→ reduce max_per_week"]
        M4 -->|"Graph stagnant"| S4["Consider: extraction too conservative<br/>→ lower min_storage_confidence"]
        M5 -->|"High corrections"| S5["Consider: extraction too aggressive<br/>→ raise confidence thresholds"]
        M6 -->|"Low efficiency"| S6["Consider: wrong LLM or prompts<br/>→ tune extraction prompts"]
    end

    style M3 fill:#fdd,stroke:#c00
    style M5 fill:#fdd,stroke:#c00
    style S3 fill:#fff3cd,stroke:#856404
    style S5 fill:#fff3cd,stroke:#856404
Metric Healthy Range Action if Out of Range
Probe acceptance rate 40–70% Below 40%: raise min_context_fit. Above 70%: can lower slightly for more learning
Starter engagement rate 30–60% Below 30%: raise min_relevance or reduce frequency
Probe annoyance signal < 20% Above 20%: immediately reduce max_probes_per_week
Graph growth rate 5–20 edges/week Below 5: lower min_storage_confidence. Above 20: normal for active user
Correction rate < 5% Above 5%: extraction quality issue — audit pipeline
False positive rate (high-confidence wrong facts) < 2% Above 2%: critical — raise all confidence base scores

21. Document Index

This document is part of the SecretAI Rails Experience Memory documentation series:

Document Description Status
Experience Memory Summary Technology overview, competitive positioning, roadmap ✅ Complete
Security Architecture Comparison Confidential VM vs. localhost trust model ✅ Complete
Experience Memory Architecture System diagrams, data flows, schemas, deployment ✅ Complete
Integration, Testing, and Configuration This document ✅ Complete
Experience Memory API Reference MCP tool specifications, gRPC protobuf definitions 📋 Planned
Extraction Pipeline Tuning Guide Per-stage configuration, LLM prompts, evaluation 📋 Planned
Knowledge Graph Operations Manual Neo4j migrations, backup/restore, performance 📋 Planned

SecretAI — Agents that learn. Memory that compounds. Privacy that's provable.