Experience Memory: Integration, Testing, and Configuration¶
SecretAI Rails — Engineering Specification v1.0 — February 2026
Part I: System Integration¶
1. Integration Overview¶
Experience Memory interacts with the rest of the SecretAI Rails platform through three interaction modes, each with distinct reliability and latency requirements.
graph TB
subgraph Synchronous ["Critical Path"]
direction LR
S1["Agent calls em_query<br/>during conversation"]
S2["Agent calls em_get_probes<br/>between turns"]
S3["Agent calls em_get_provenance<br/>when user asks 'why?'"]
end
subgraph Asynchronous ["Fire and Forget"]
direction LR
A1["Agent reports interaction<br/>via inbound queue"]
A2["Agent reports user correction<br/>via em_user_correction"]
A3["Skills report outcomes<br/>via event bus"]
end
subgraph Background ["Autonomous
direction LR"]
B1["Revision workers<br/>audit graph overnight"]
B2["Inference workers<br/>discover patterns"]
B3["Event monitor<br/>polls external sources"]
B4["Calendar trigger<br/>checks upcoming events"]
end
S1 -->|"< 100ms"| REQ["Requirement:<br/>Fast, graceful fallback"]
A1 -->|"< 10ms enqueue"| REQ2["Requirement:<br/>Buffer, never block"]
B1 -->|"No latency constraint"| REQ3["Requirement:<br/>Don't starve sync path"]
style S1 fill:#fdd,stroke:#c00
style S2 fill:#fdd,stroke:#c00
style S3 fill:#fdd,stroke:#c00
style A1 fill:#fff3cd,stroke:#856404
style A2 fill:#fff3cd,stroke:#856404
style A3 fill:#fff3cd,stroke:#856404
style B1 fill:#d4edda,stroke:#155724
style B2 fill:#d4edda,stroke:#155724
style B3 fill:#d4edda,stroke:#155724
style B4 fill:#d4edda,stroke:#155724
2. Agent Runtime ↔ Experience Memory¶
This is the primary integration surface. The Agent Runtime is the main consumer of Experience Memory.
2.1 Conversation Lifecycle Integration¶
sequenceDiagram
participant U as User
participant AR as Agent Runtime
participant EM_MCP as EM MCP Server
participant IQ as Inbound Queue
participant OQ as Outbound Queue
participant NEO as Neo4j
Note over AR: User message arrives
rect rgb(255, 235, 235)
Note over AR,EM_MCP: Phase 1: Context Retrieval (Sync, Critical Path)
AR->>EM_MCP: em_query({entity: "Lena",<br/>min_confidence: 0.5, max_hops: 2})
EM_MCP->>NEO: Cypher query
NEO-->>EM_MCP: Subgraph result
EM_MCP-->>AR: {entities: [...], edges: [...]}
Note over AR: Agent now has context<br/>about Lena for this turn
end
rect rgb(235, 255, 235)
Note over AR: Phase 2: Generate Response
AR->>AR: LLM call with graph context<br/>injected into system prompt
AR->>U: Response delivered
end
rect rgb(255, 248, 220)
Note over AR,IQ: Phase 3: Report Interaction (Async, Non-blocking)
AR->>IQ: em_report_interaction({<br/>text: "...",<br/>entities_mentioned: ["Lena", "wine"],<br/>session_id: "...",<br/>timestamp: "..."})
Note over IQ: Queued for extraction.<br/>Agent does NOT wait.
end
rect rgb(220, 235, 255)
Note over AR,OQ: Phase 4: Check for Probes (Sync, Optional)
AR->>OQ: em_get_probes({<br/>active_topics: ["wine"],<br/>entities_in_scope: ["Lena"]})
OQ-->>AR: Probe: "Lena's birthday..."<br/>or empty []
alt Probe available and context fits
AR->>U: "By the way — Lena's birthday..."
else No probe or poor fit
Note over AR: Skip, continue normally
end
end
2.2 Failure Modes and Fallbacks¶
Every integration point must have a defined failure mode. Experience Memory is an enhancement, not a dependency — the agent must always be able to respond even if EM is completely down.
graph TB
subgraph em_query Failure
F1_TRIGGER["Neo4j timeout<br/>or EM process down"]
F1_ACTION["Fallback: Agent responds<br/>without graph context.<br/>Use conversation history only."]
F1_RECOVER["Log missed query.<br/>EM restarts automatically<br/>via supervisor."]
end
subgraph em_report_interaction Failure
F2_TRIGGER["Redis queue full<br/>or EM ingestion stalled"]
F2_ACTION["Fallback: Buffer in Agent's<br/>local memory (last 100 events).<br/>Retry on reconnect."]
F2_RECOVER["Agent drains buffer<br/>to queue when EM recovers."]
end
subgraph em_get_probes Failure
F3_TRIGGER["Outbound queue<br/>unreachable"]
F3_ACTION["Fallback: Skip probing.<br/>Conversation continues normally.<br/>User never notices."]
F3_RECOVER["Probes remain in queue.<br/>Delivered on next successful check."]
end
subgraph Background Worker Failure
F4_TRIGGER["Revision or inference<br/>worker crashes"]
F4_ACTION["Fallback: No immediate impact.<br/>Knowledge graph is stale<br/>but still functional."]
F4_RECOVER["Supervisor restarts worker.<br/>Missed batch runs at next<br/>scheduled window."]
end
style F1_TRIGGER fill:#fdd,stroke:#c00
style F2_TRIGGER fill:#fdd,stroke:#c00
style F3_TRIGGER fill:#fdd,stroke:#c00
style F4_TRIGGER fill:#fdd,stroke:#c00
style F1_ACTION fill:#fff3cd,stroke:#856404
style F2_ACTION fill:#fff3cd,stroke:#856404
style F3_ACTION fill:#fff3cd,stroke:#856404
style F4_ACTION fill:#fff3cd,stroke:#856404
style F1_RECOVER fill:#d4edda,stroke:#155724
style F2_RECOVER fill:#d4edda,stroke:#155724
style F3_RECOVER fill:#d4edda,stroke:#155724
style F4_RECOVER fill:#d4edda,stroke:#155724
| Integration Point | Failure Impact | User Visible? | Recovery |
|---|---|---|---|
em_query timeout |
Agent responds without graph context | Subtle — less personalized response | Auto-retry next turn |
em_query returns empty |
Agent has no knowledge about topic | Subtle — behaves like new agent | Normal — graph may not have data yet |
| Inbound queue full | Interactions not processed, graph stops learning | No | Buffer in agent, drain on recovery |
| Inbound queue slow | Extraction delayed, probes arrive late | No | Queue catches up during idle |
| Outbound queue unreachable | No probes or starters delivered | No — agent just doesn't probe | Probes accumulate, delivered on recovery |
| Neo4j down | All graph queries fail | Yes — no personalization | Supervisor restarts, agent uses fallback |
| Small LLM down | Extraction pipeline stalls | No (async) | Queue buffers, processes on recovery |
| Large LLM down | Inference and synthesis stall | No (background) | Retries at next scheduled window |
| Redis down | All queues fail | Partial — no probes, interactions buffer | Supervisor restarts, agents use local buffer |
2.3 Context Injection into Agent's LLM Prompt¶
When the Agent Runtime queries Experience Memory, the results need to be formatted and injected into the LLM's context. This is where the graph becomes actionable.
graph TB
subgraph Agent builds LLM prompt
SYS["System Prompt<br/>(agent personality, rules)"]
EM_CTX["Experience Memory Context Block<br/>(injected from em_query results)"]
CONV["Conversation History<br/>(recent turns)"]
USER_MSG["Current User Message"]
end
SYS --> PROMPT["Assembled Prompt"]
EM_CTX --> PROMPT
CONV --> PROMPT
USER_MSG --> PROMPT
PROMPT --> LLM["LLM"]
style EM_CTX fill:#d4edda,stroke:#155724
Context block format — structured, concise, confidence-annotated:
<experience_context>
<entity name="Lena" relation="wife">
<fact confidence="0.90" type="trait">Likes flowers</fact>
<fact confidence="0.90" type="trait">Likes postcards</fact>
<fact confidence="0.85" type="trait">Prefers red wine, especially Malbec</fact>
<fact confidence="0.55" type="wish" expires="2026-12">May want a new kitchen chair set</fact>
<fact confidence="0.99" type="state">Birthday: December 2</fact>
</entity>
<pending_probe topic="wine" priority="0.85">
Lena's birthday is ~1 month away. Knowledge gap: does she enjoy wine subscriptions?
</pending_probe>
<active_reminders>
Birthday reminder for Lena: fires November 25
</active_reminders>
</experience_context>
The agent's system prompt includes instructions on how to use this context: prefer high-confidence facts, hedge when using low-confidence facts, never reveal raw confidence scores to the user, and integrate probes naturally into conversation when context fits.
3. Session Manager ↔ Experience Memory¶
The Session Manager routes messages from channel adapters to the Agent Runtime. It also provides session metadata that Experience Memory needs for context.
sequenceDiagram
participant CH as Channel Adapter
participant SM as Session Manager
participant AR as Agent Runtime
participant EM as Experience Memory
CH->>SM: Incoming message from Telegram
SM->>SM: Resolve session:<br/>user_id, channel, session_id,<br/>conversation_start, turn_count
SM->>AR: Deliver message + session metadata
AR->>EM: em_report_interaction({<br/>...,<br/>channel: "telegram",<br/>session_id: "sess_abc123",<br/>turn_number: 7,<br/>conversation_topic: "wine"})
Note over EM: Channel and session metadata<br/>help EM understand context:<br/>- Telegram = casual channel<br/>- Turn 7 = mid-conversation<br/>- Topic = wine
| Session Metadata | Used By EM For |
|---|---|
channel |
Adjusting probe formality (Slack = professional, Telegram = casual) |
session_id |
Grouping interactions into episodes |
turn_number |
Determining if now is appropriate for probing (not turn 1) |
conversation_start |
Calculating conversation duration for episode records |
user_timezone |
Scheduling starters and revision delivery |
user_locale |
Language and cultural context for probes |
4. Skill Orchestrator ↔ Experience Memory¶
Skills (MCP servers) don't talk to Experience Memory directly — they're sandboxed. But the Skill Orchestrator reports skill outcomes to EM, and EM can influence which skills get invoked and how.
sequenceDiagram
participant AR as Agent Runtime
participant SO as Skill Orchestrator
participant SK as MCP Skill:<br/>Deal Monitor
participant EM as Experience Memory
participant IQ as Inbound Queue
AR->>SO: Invoke Deal Monitor skill<br/>for kitchen chair deals
SO->>SK: Execute: monitor({<br/>query: "kitchen chair set",<br/>budget: "under $500",<br/>notify_channel: "telegram"})
SK-->>SO: Acknowledged: monitoring started
Note over SK: Days later...
SK->>SO: Result: Deal found!<br/>$299 at Wayfair, 40% off
SO->>AR: Skill result: deal found
SO->>IQ: Skill outcome event:{<br/>skill: "deal_monitor",<br/>task: "kitchen chairs for Lena",<br/>outcome: "success",<br/>result: {price: 299, store: "Wayfair"}}
IQ->>EM: Process skill outcome
EM->>EM: Update procedure:<br/>"Deal monitoring for gifts"<br/>success_rate += 1<br/>Link to episode
AR->>AR: Format and deliver to user
Experience Memory Informing Skill Invocation¶
sequenceDiagram
participant U as User
participant AR as Agent Runtime
participant EM as EM MCP Server
participant SO as Skill Orchestrator
participant SK as MCP Skill
U->>AR: "Book me a flight to Tokyo"
AR->>EM: em_query({entity: "User",<br/>relation: "travel_preferences"})
EM-->>AR: {<br/>prefers: "direct flights",<br/>budget_style: "budget-conscious",<br/>seat_preference: "aisle",<br/>airline_loyalty: "ANA",<br/>confidence: 0.80}
AR->>SO: Invoke flight search skill<br/>with EM-enriched parameters:<br/>{destination: "Tokyo",<br/>prefer_direct: true,<br/>prefer_airline: "ANA",<br/>seat: "aisle",<br/>sort_by: "price"}
Note over AR: Without EM, agent would ask<br/>user all these questions.<br/>With EM, it already knows.
5. Control UI and Studio ↔ Experience Memory¶
The Control UI provides the user-facing interface for graph inspection, and Studio needs EM context for workflow design.
graph TB
subgraph Control UI
GRAPH_VIZ["Graph Visualizer<br/>Interactive node/edge explorer"]
FACT_MGR["Fact Manager<br/>View, edit, delete facts"]
CONF_PANEL["Confidence Dashboard<br/>Edge confidence distribution"]
ACTIVITY["Activity Log<br/>Recent extractions,<br/>probes delivered"]
SETTINGS["EM Settings<br/>Configuration knobs"]
end
subgraph Studio
WF_CTX["Workflow Context<br/>Inject EM knowledge<br/>into workflow conditions"]
WF_TRIGGER["Workflow Triggers<br/>Fire workflow when<br/>graph state changes"]
end
subgraph EM API
MCP_Q["em_query"]
MCP_SNAP["em_graph_snapshot"]
MCP_DEL["em_delete_node/edge"]
MCP_PROV["em_get_provenance"]
EVENTS["Event Stream<br/>(WebSocket)"]
end
GRAPH_VIZ -->|"Read"| MCP_SNAP
FACT_MGR -->|"Read/Delete"| MCP_DEL
FACT_MGR -->|"Why?"| MCP_PROV
CONF_PANEL -->|"Stats"| MCP_SNAP
ACTIVITY -->|"Stream"| EVENTS
SETTINGS -->|"Config API"| EM_CONFIG["em_update_config"]
WF_CTX -->|"Read"| MCP_Q
WF_TRIGGER -->|"Subscribe"| EVENTS
style GRAPH_VIZ fill:#d4edda,stroke:#155724
style SETTINGS fill:#fff3cd,stroke:#856404
style WF_TRIGGER fill:#cce5ff,stroke:#004085
Studio Workflow Trigger Example¶
A user designs a workflow in Studio: "When someone I know has a birthday within 14 days AND I don't have a gift planned, start the gift research procedure."
graph LR
TRIGGER["Graph Trigger:<br/>(:Person)-[:BIRTHDAY_WITHIN]->(14 days)<br/>AND NOT (:Person)-[:HAS_GIFT_PLAN]->()"]
TRIGGER --> COND["Condition:<br/>Person is in user's<br/>inner circle<br/>(confidence > 0.8)"]
COND --> ACTION["Action:<br/>1. Query EM for person's likes<br/>2. Spawn deal monitor skill<br/>3. Create reminder 7 days before"]
style TRIGGER fill:#fff3cd,stroke:#856404
style ACTION fill:#d4edda,stroke:#155724
6. Voice Pipeline ↔ Experience Memory¶
The STT-TTS pipeline introduces unique challenges: speech is noisier than text, and voice conversations tend to be more casual and faster-paced.
sequenceDiagram
participant U as User (Voice)
participant STT as STT Engine
participant SM as Session Manager
participant AR as Agent Runtime
participant EM as Experience Memory
participant TTS as TTS Engine
U->>STT: [Speech audio]
STT->>SM: Transcript: "Hey can you remind me<br/>to grab flowers for Lena<br/>on my way home?"
STT->>SM: Metadata: {<br/>confidence: 0.92,<br/>language: "en",<br/>noise_level: "low",<br/>speaking_rate: "fast"}
SM->>AR: Deliver + metadata
AR->>EM: em_query({entity: "Lena"})
EM-->>AR: {relation: "wife",<br/>likes: ["flowers"], ...}
AR->>AR: Generate response
AR->>TTS: "Got it — I'll remind you<br/>to pick up flowers for Lena<br/>when you're heading home.<br/>Any particular kind she likes?"
TTS->>U: [Speech audio]
AR->>EM: em_report_interaction({<br/>...,<br/>channel: "voice",<br/>stt_confidence: 0.92,<br/>speaking_rate: "fast"})
Note over EM: Voice metadata affects extraction:<br/>- Lower STT confidence → lower fact confidence<br/>- Fast speaking rate → more likely to contain<br/> casual/imprecise statements<br/>- Voice channel → adjust probe tone to conversational
| Voice-Specific Consideration | How EM Handles It |
|---|---|
| STT transcription errors | Reduce extraction confidence proportionally to STT confidence score |
| Casual/imprecise language | Widen hedging detection thresholds |
| Faster conversation pace | Reduce probe frequency (1 per 5 turns instead of 1 per conversation) |
| No visual UI available | Probes must be short and conversational |
| Ambient context (driving, cooking) | Adjust starter urgency thresholds based on detected activity |
7. Event Sources ↔ Experience Memory¶
The Event Monitor polls external sources and cross-references against the knowledge graph. Integration points:
| Event Source | Polling Method | Data Format | Relevance Filter |
|---|---|---|---|
| Weather API | REST poll every 30 min | JSON (location, severity, timeframe) | Match against user's lives_in, planning |
| News API | REST poll every 15 min | JSON (headlines, topics, entities) | Match against user's interested_in, works_at |
| Market data | WebSocket stream (if enabled) | JSON (ticker, price, change) | Match against user's invested_in, tracks |
| Calendar (user's own) | CalDAV sync every 5 min | iCal events | Upcoming events within N-day lookahead |
| Email digest (if email skill enabled) | Skill reports summaries | Structured JSON | Match against known entities and active tasks |
graph TB
subgraph External Sources
WEATHER["Weather API<br/>Poll: 30 min"]
NEWS["News API<br/>Poll: 15 min"]
MARKET["Market Data<br/>WebSocket stream"]
CAL["Calendar<br/>CalDAV: 5 min"]
end
subgraph Event Monitor
POLL["Poller / Subscriber"]
NORM["Event Normalizer<br/>Canonical event schema"]
REL["Relevance Filter<br/>Graph-backed scoring"]
end
subgraph Proactive Engine
PE["Trigger Evaluation"]
OQ["Outbound Queue"]
end
WEATHER --> POLL
NEWS --> POLL
MARKET --> POLL
CAL --> POLL
POLL --> NORM --> REL
REL -->|"Score > 0.50"| PE
REL -->|"Score < 0.50"| DISCARD["Discard"]
PE --> OQ
style REL fill:#fff3cd,stroke:#856404
style PE fill:#d4edda,stroke:#155724
Part II: Testing Strategy¶
8. Testing Pyramid¶
Experience Memory requires testing at five levels, each catching different classes of bugs.
graph TB
subgraph Level 5: Experience Quality Evaluation
L5["Does the agent feel like it<br/>knows the user after 50 conversations?<br/>Human evaluation + LLM-as-judge"]
end
subgraph Level 4: Scenario Simulation
L4["Multi-conversation scenarios<br/>running compressed timelines.<br/>Verify emergent behavior."]
end
subgraph Level 3: Integration Tests
L3["Component-to-component:<br/>Agent → EM → Neo4j → Outbound Queue<br/>Full data flow verification."]
end
subgraph Level 2: Component Tests
L2["Extraction Pipeline, Graph Diff Engine,<br/>Proactive Engine, Background Workers<br/>tested in isolation with mocked deps."]
end
subgraph Level 1: Unit Tests
L1["Individual functions:<br/>confidence scoring formula,<br/>hedge detection, temporal parsing,<br/>Cypher query builders."]
end
L1 --> L2 --> L3 --> L4 --> L5
style L1 fill:#d4edda,stroke:#155724
style L2 fill:#d4edda,stroke:#155724
style L3 fill:#fff3cd,stroke:#856404
style L4 fill:#cce5ff,stroke:#004085
style L5 fill:#e8daef,stroke:#6c3483
9. Level 1 — Unit Tests¶
| Component | Test Focus | Example Test Cases |
|---|---|---|
| Confidence scorer | Formula correctness | explicit + no_hedge + strong_sentiment = 0.90 × 1.0 × 1.0 = 0.90 |
inferred + moderate_hedge + neutral = 0.45 × 0.65 × 0.80 = 0.234 |
||
| Boundary conditions | confidence never exceeds 1.0 after reinforcement |
|
confidence never goes below 0.0 after decay |
||
| Hedge detector | Keyword classification | "loves" → no_hedge, "may like" → moderate_hedge |
| Edge cases | "I don't think she hates it" → complex negation handling |
|
| Temporal parser | Type classification | "this year" → wish, expires: 2026-12-31 |
"always" → trait, no expiry |
||
"last Tuesday" → episode, occurred_at: 2026-02-10 |
||
| Relative dates | "next month" → correct absolute date |
|
| Confidence decay | Math correctness | 0.85 × (1 - 0.01) = 0.8415 after 1 cycle |
Edge archived when confidence < 0.15 |
||
| Cypher builders | Query correctness | query_entity("Lena", min_confidence=0.5) → valid Cypher |
| Injection safety | entity name with quotes/special chars → escaped |
Test count target: ~200 unit tests covering all formula paths and edge cases.
10. Level 2 — Component Tests¶
Each major component tested in isolation with mocked dependencies.
10.1 Extraction Pipeline Tests¶
graph LR
INPUT["Test Input:<br/>Curated conversation<br/>transcripts with<br/>known entities"] --> EP["Extraction<br/>Pipeline"]
MOCK_LLM["Mock Small LLM<br/>Returns predetermined<br/>structured JSON"] --> EP
EP --> OUTPUT["Pipeline Output:<br/>Extracted edges<br/>with scores"]
OUTPUT --> ASSERT["Assertions:<br/>- Correct entities found<br/>- Relations properly typed<br/>- Confidence scores in range<br/>- Temporal types assigned<br/>- Hedges detected"]
style INPUT fill:#cce5ff,stroke:#004085
style MOCK_LLM fill:#e8daef,stroke:#6c3483
style ASSERT fill:#d4edda,stroke:#155724
Test corpus structure:
| Category | Input Example | Expected Extraction |
|---|---|---|
| Explicit fact | "My wife's name is Lena" | Person(Lena) →wife_of→ User, confidence: 0.90 |
| Hedged preference | "She may like kitchen chairs" | Lena →may_want→ Kitchen chairs, confidence: 0.55, type: wish |
| Strong preference | "She absolutely loves Malbec" | Lena →loves→ Malbec, confidence: 0.95, type: trait |
| Negation | "She doesn't drink beer" | Lena →dislikes→ Beer, confidence: 0.85, type: trait |
| Indirect inference | "Pick up the kids from school" | User →has_children→ Children, confidence: 0.85, source: inference |
| Temporal | "We're going to Tokyo in March" | User →planning→ Tokyo trip, type: wish, dates: March 2026 |
| Contradiction | "Actually she's turning 46, not 47" | Correction: Lena →age→ 46, revise previous |
| No new info | "Thanks, that's helpful" | No graph mutation |
| Complex sentence | "My wife would kill me if I forgot again, she's turning 47 in December" | Multiple: relationship confirmed, negative episode, age inferred, birthday month |
Minimum test corpus: 100 labeled examples across all categories, with at least 10 per category.
10.2 Graph Diff Engine Tests¶
graph LR
subgraph Test Setup
SEED["Seed Neo4j<br/>with known graph state"]
INPUT_E["Input edges<br/>from extraction pipeline"]
end
subgraph Execution
GDE["Graph Diff Engine"]
end
subgraph Assertions
A1["INSERT: new edge created<br/>with correct properties"]
A2["REINFORCE: confidence boosted,<br/>last_reinforced updated"]
A3["CONTRADICT: conflict flagged,<br/>resolution correct"]
A4["SKIP: no graph mutation<br/>when no new info"]
A5["MERGE: existing edge enriched<br/>without duplication"]
end
SEED --> GDE
INPUT_E --> GDE
GDE --> A1
GDE --> A2
GDE --> A3
GDE --> A4
GDE --> A5
style SEED fill:#cce5ff,stroke:#004085
style GDE fill:#fff3cd,stroke:#856404
style A1 fill:#d4edda,stroke:#155724
| Scenario | Seed State | Input | Expected Operation |
|---|---|---|---|
| Brand new entity | Empty graph | Lena →wife_of→ User | INSERT both nodes + edge |
| Known entity, new relation | Lena exists | Lena →likes→ Malbec | INSERT edge only |
| Reinforcement | Lena →likes→ Flowers (0.70) | Lena →likes→ Flowers (0.90) | REINFORCE: confidence → 0.82 |
| Contradiction | Lena →age→ 47 (0.80) | Lena →age→ 46 (0.90) | CONTRADICT → REVISE to 46 |
| More specific | Lena →likes→ Wine (0.85) | Lena →prefers→ Malbec (0.85) | MERGE: add specific edge, keep general |
| Duplicate | Lena →wife_of→ User (0.95) | Lena →wife_of→ User (0.90) | REINFORCE (not duplicate) |
| Expired wish | Kitchen chairs (expired 2025-12) | Kitchen chairs mentioned again | INSERT new wish with new expiry |
10.3 Proactive Engine Tests¶
| Scenario | Input Trigger | Graph State | Expected Output |
|---|---|---|---|
| Auto-execute | DOB discovered | Lena DOB: Dec 2 (0.99) | Auto-create birthday reminder |
| Suggest, high confidence | Deal found for known interest | Lena →likes→ Malbec (0.90) + deal | Suggestion in outbound queue |
| Suggest, low confidence | Deal found for hedged wish | Lena →may_want→ Chairs (0.55) + deal | Casual suggestion in queue |
| Defer | Possible gift idea, low confidence | Lena →may_want→ something (0.30) | Defer, queue for reinforcement |
| Probe generated | Knowledge gap + matching context | wine conversation + no wife wine pref | Probe in outbound queue |
| Probe suppressed | Knowledge gap + no matching context | coding conversation + no wife wine pref | Probe stays in queue |
| Frequency limit | 3 probes already delivered this week | Any trigger | Suppress, respect limit |
| Timing suppressed | Good probe, but 2am user time | Any trigger | Queue with earliest_delivery constraint |
11. Level 3 — Integration Tests¶
Full data flow tests with real Neo4j (test instance) and real Redis, but mocked LLMs.
graph TB
subgraph Test Harness
DRIVER["Test Driver<br/>Simulates agent interactions"]
MOCK_AG["Mock Agent Runtime<br/>Sends interactions,<br/>pulls probes"]
end
subgraph Real Services (Dockerized)
EM["Experience Memory<br/>Service"]
NEO_T["Neo4j<br/>(test instance)"]
REDIS_T["Redis<br/>(test instance)"]
VEC_T["Vector Store<br/>(test instance)"]
end
subgraph Mocked Services
LLM_MOCK["LLM Mock Server<br/>Deterministic responses<br/>from test corpus"]
end
DRIVER --> MOCK_AG
MOCK_AG --> EM
EM --> NEO_T
EM --> REDIS_T
EM --> VEC_T
EM --> LLM_MOCK
DRIVER -->|"Assert graph state"| NEO_T
DRIVER -->|"Assert queue contents"| REDIS_T
style DRIVER fill:#cce5ff,stroke:#004085
style NEO_T fill:#d4edda,stroke:#155724
style REDIS_T fill:#d4edda,stroke:#155724
style LLM_MOCK fill:#e8daef,stroke:#6c3483
Integration Test Scenarios¶
| Test Name | Steps | Assertions |
|---|---|---|
| Full extraction flow | Send 1 interaction → wait for extraction → check graph | Correct nodes/edges in Neo4j |
| Reinforcement over 3 turns | Send 3 interactions mentioning same preference | Edge confidence increases monotonically |
| Contradiction and revision | Send fact, then contradicting fact | Old edge revised, new edge created with higher confidence |
| Probe generation and delivery | Create knowledge gap, send matching-context interaction | Probe appears in outbound queue with correct context tags |
| Probe delivery timing | Create probe, send 5 interactions with wrong context | Probe NOT delivered until matching context |
| Background decay | Seed edges, trigger decay worker | Confidence decreased by correct amount |
| Background archival | Seed low-confidence edges, trigger archival | Edges below threshold removed |
| Inference chain | Seed multi-hop graph, trigger inference worker | New low-confidence edges created connecting distant nodes |
| Event-driven starter | Seed graph with location + event, inject weather alert | Starter in outbound queue with correct timing constraints |
| Skill outcome reporting | Simulate skill completion event | Episode created, procedure success_rate updated |
| Failure recovery | Kill Neo4j mid-extraction, restart | Extraction retries, no data loss, agent buffer drained |
Test infrastructure: Docker Compose with Neo4j, Redis, Vector Store, LLM mock server. Tests run in CI on every PR.
12. Level 4 — Scenario Simulation¶
This is the critical testing level. We need to verify that Experience Memory produces the right emergent behavior over time, not just correct individual operations.
12.1 Scenario Simulator¶
graph TB
subgraph Scenario Definition
PERSONA["User Persona<br/>Predefined personality,<br/>preferences, life situation"]
SCRIPT["Conversation Script<br/>50-100 interactions<br/>spanning simulated weeks"]
EXPECTED["Expected Graph State<br/>What EM should know<br/>after all interactions"]
end
subgraph Simulator
SIM["Scenario Simulator<br/>Feeds interactions with<br/>simulated timestamps"]
TIME["Time Warp<br/>Compress weeks<br/>into minutes.<br/>Trigger decay/revision<br/>between simulated days"]
end
subgraph Evaluation
GRAPH_CMP["Graph Comparison<br/>Expected vs actual<br/>nodes, edges, confidence"]
PROBE_EVAL["Probe Evaluation<br/>Were probes generated<br/>at the right moments?"]
STARTER_EVAL["Starter Evaluation<br/>Were starters relevant<br/>and well-timed?"]
FALSE_POS["False Positive Check<br/>Any incorrect facts<br/>stored with high confidence?"]
end
PERSONA --> SIM
SCRIPT --> SIM
SIM --> TIME
TIME --> GRAPH_CMP
TIME --> PROBE_EVAL
TIME --> STARTER_EVAL
TIME --> FALSE_POS
EXPECTED --> GRAPH_CMP
style SIM fill:#cce5ff,stroke:#004085
style TIME fill:#fff3cd,stroke:#856404
style FALSE_POS fill:#fdd,stroke:#c00
12.2 Test Personas¶
| Persona | Profile | Key Test Focus |
|---|---|---|
| Alex the CTO | Married (Lena), 2 kids, works at FinTech startup, codes in Python, travels quarterly | Full lifecycle: family, work, coding preferences, travel patterns |
| Maya the Freelancer | Single, graphic designer, multiple clients, irregular schedule, budget-conscious | Temporal patterns with irregular hours, multi-client context switching |
| James the Retiree | Widower, hobby gardener, follows markets, has grandchildren, less tech-savvy | Gentle probing, simpler language, health/wellness sensitivity |
| Sara the Student | University, part-time job, roommates, limited budget, social media active | Fast-changing interests, budget constraints, social context |
12.3 Scenario Script Example — "Alex the CTO"¶
# Week 1
Turn 1: "Help me write a Python script to parse CSV files"
→ EM should extract: User prefers Python (explicit)
Turn 2: "Use type hints please, I hate untyped code"
→ EM should extract: strong preference for type hints (explicit, high confidence)
Turn 3: "Can you move my Thursday standup to Friday?"
→ EM should infer: has recurring standup (indirect inference)
# Week 2
Turn 4: "I want to buy a present for my wife"
→ EM should detect: knowledge gap about wife
Turn 5: "Her name is Lena. Birthday is December 2, 1979"
→ EM should create: Person(Lena), wife relation, DOB
→ EM should auto-suggest: birthday reminder
Turn 6: "She likes flowers and postcards. Maybe kitchen chairs this year"
→ EM should create: traits (flowers, postcards) + wish (chairs)
→ EM should spawn: deal monitoring if skill available
# Week 3
Turn 7: [wine conversation]
→ EM should surface probe: "Does Lena enjoy wine?"
Turn 8: "She loves reds, especially Malbec"
→ EM should create: wine/Malbec preferences for Lena
# Week 4 (simulated) — Background revision runs
→ EM should verify: Acme Corp still exists (public fact)
→ EM should run inference: Malbec → Argentina → user's travel interest?
→ EM should decay: any edges not reinforced
# Week 5
Turn 9: "We're planning a trip to Tokyo in March"
→ EM should create: Tokyo trip, March dates
→ EM should enrich: travel preferences from past patterns
Turn 10: "Book direct flights, I always prefer direct"
→ EM should reinforce: direct flight preference
12.4 Evaluation Metrics¶
| Metric | Target | Measurement |
|---|---|---|
| Precision — facts stored are correct | > 95% | Compare graph to ground truth persona |
| Recall — facts mentioned are captured | > 85% | Count of expected facts found in graph |
| Confidence calibration — confidence scores match actual correctness | Calibration error < 0.10 | Compare confidence to binary correct/incorrect |
| False positive rate — incorrect facts stored at high confidence | < 2% | Count of wrong facts with confidence > 0.70 |
| Probe relevance — probes generated at appropriate moments | > 80% relevant | Human evaluation of probe timing and context fit |
| Probe naturalness — probes feel conversational, not interrogative | > 85% natural | Human evaluation / LLM-as-judge |
| Inference accuracy — overnight inferences are reasonable | > 60% useful | Human evaluation of inferred connections |
| Contradiction handling — corrections properly applied | 100% | All explicit corrections reflected in graph |
13. Level 5 — Experience Quality Evaluation¶
The highest-level test: does the agent feel like it knows the user?
13.1 LLM-as-Judge Evaluation¶
sequenceDiagram
participant SIM as Scenario<br/>Simulator
participant EM as Experience<br/>Memory
participant AR as Agent Runtime<br/>(with EM)
participant AR_B as Agent Runtime<br/>(without EM, baseline)
participant JUDGE as LLM Judge<br/>(evaluator)
SIM->>EM: Run 50-interaction scenario
SIM->>AR: Present test prompt
AR->>AR: Generate response with EM context
SIM->>AR_B: Present same test prompt
AR_B->>AR_B: Generate response without EM
SIM->>JUDGE: "Here is a user profile.<br/>Here are two agent responses<br/>to the same question.<br/>Which response demonstrates<br/>better understanding of the user?<br/>Rate on 5 dimensions."
JUDGE-->>SIM: Evaluation scores
13.2 Evaluation Dimensions¶
| Dimension | What It Measures | Example |
|---|---|---|
| Personalization | Does the response reflect user's known preferences? | Agent suggests Python with type hints without being asked |
| Anticipation | Does the agent predict needs before being asked? | Agent mentions upcoming birthday during gift conversation |
| Consistency | Does the agent maintain coherent knowledge across topics? | Wife's name is always Lena, never confused |
| Naturalness | Do probes and suggestions feel organic? | Wine → birthday probe feels like a thoughtful friend |
| Restraint | Does the agent avoid overstepping or being creepy? | Agent doesn't volunteer private info unprompted |
13.3 A/B Comparison Framework¶
| Test | Agent A (with EM) | Agent B (baseline) | Evaluator |
|---|---|---|---|
| Same-session personalization | Uses graph context | Uses conversation history only | LLM judge |
| Cross-session knowledge | Remembers facts from weeks ago | Each session starts fresh | LLM judge |
| Proactive suggestions | Offers birthday reminder, deal alerts | Only responds to explicit requests | Human evaluation |
| Preference-aware skill invocation | Passes EM preferences to skills | Asks user every time | Task completion speed |
| Error recovery after correction | Updates graph, never repeats mistake | May repeat error in future sessions | Correctness tracking |
Part III: Configuration¶
14. Configuration Philosophy¶
Sensible defaults. Minimal required configuration. Expert overrides available.
The configuration is organized into three tiers:
| Tier | Audience | Changed When | Examples |
|---|---|---|---|
| Defaults | Nobody (baked in) | Never by user, only by SecretAI engineering | Confidence formula weights, graph schema |
| Profile | Every user (on setup) | During onboarding or in Settings | Timezone, language, preferred channels, proactivity level |
| Expert | Power users / developers | Via Control UI or config file | Decay rates, probe frequency, inference schedule |
15. Configuration Schema¶
# experience_memory.yaml
# SecretAI Rails — Experience Memory Configuration
# ============================================================
# PROFILE TIER — User-facing settings
# ============================================================
profile:
# User timezone — affects scheduling of starters, revision, probes
timezone: "America/Los_Angeles"
# Primary language for probe generation
language: "en"
# How proactive should the agent be?
# conservative: Only suggest when explicitly relevant, minimal probing
# balanced: Contextual probes, moderate starters (DEFAULT)
# proactive: Frequent probes, more conversation starters, anticipatory
proactivity_level: "balanced"
# Channels where agent may initiate conversation (starters)
# Agent will never initiate on channels not listed here
starter_channels:
- "telegram"
- "whatsapp"
# Do Not Disturb hours — no agent-initiated contact
quiet_hours:
start: "22:00"
end: "07:00"
# Should the agent explain why it knows something when asked?
# If false, agent says "I remember from our conversations"
# If true, agent traces provenance chain
explain_knowledge: true
# ============================================================
# EXTRACTION — Controls how facts are extracted from conversation
# ============================================================
extraction:
# Enable/disable indirect inference (Pattern 4)
# Some users may prefer only explicit fact storage
enable_indirect_inference: true
# Minimum confidence threshold for storing an extracted fact
# Below this, the extraction is discarded
min_storage_confidence: 0.25
# Maximum entities to extract per message
# Prevents runaway extraction on very long messages
max_entities_per_message: 20
# Maximum relations to extract per message
max_relations_per_message: 30
# Enable/disable sentiment analysis on extractions
enable_sentiment: true
# STT confidence multiplier — reduce extraction confidence
# when speech-to-text confidence is low
stt_confidence_floor: 0.70 # Below this STT score, skip extraction entirely
stt_confidence_scale: true # Scale extraction confidence by STT confidence
# ============================================================
# CONFIDENCE — Controls the confidence scoring model
# ============================================================
confidence:
# Base confidence scores by acquisition mechanism
base_scores:
explicit: 0.90 # User directly states a fact
observational: 0.65 # Pattern detected from behavior
inferential: 0.45 # Cross-context inference
reflective: 0.50 # Default for outcome-based learning
# Hedge multipliers
hedge_multipliers:
none: 1.00 # "loves", "always"
mild: 0.90 # "likes", "usually"
moderate: 0.65 # "may", "might"
strong: 0.50 # "I think", "maybe", "possibly"
# Reinforcement boost per additional episode
reinforcement_boost: 0.08
# Maximum confidence (hard cap)
max_confidence: 0.99
# Minimum confidence before archival
archive_threshold: 0.15
# Trait edges have a slower decay (protected)
trait_decay_protection: 0.5 # Multiply decay rate by this for traits
# ============================================================
# DECAY — Controls how knowledge ages
# ============================================================
decay:
# Default decay rate per cycle (monthly)
default_rate: 0.02 # 2% per month
# Decay rates by temporal type (override default)
rates_by_type:
trait: 0.005 # Very slow — near-permanent
state: 0.00 # No time decay, only contradictions
wish: 0.04 # Faster decay — time-bounded desires
episode: 0.08 # Fast decay — one-time events
# Decay cycle frequency
cycle_frequency: "weekly" # "daily" | "weekly" | "monthly"
# Grace period — no decay for N days after last reinforcement
grace_period_days: 30
# ============================================================
# PROBING — Controls contextual probing behavior
# ============================================================
probing:
# Maximum probes per conversation
max_probes_per_conversation: 1
# Maximum probes per day
max_probes_per_day: 3
# Maximum probes per week
max_probes_per_week: 10
# Minimum conversation turn before probing
# Don't probe in the first N turns of a conversation
min_turn_for_probe: 3
# Minimum context-fit score to deliver a probe
min_context_fit: 0.70
# Retry limit — after N failed context matches, lower priority
max_probe_attempts: 5
# Cooldown after user ignores a probe (days)
ignore_cooldown_days: 7
# Cooldown after user deflects a probe (days)
deflect_cooldown_days: 14
# ============================================================
# STARTERS — Controls agent-initiated conversations
# ============================================================
starters:
# Enable/disable conversation starters entirely
enabled: true
# Maximum starters per day (across all types)
max_per_day: 3
# Maximum starters per week
max_per_week: 10
# Minimum relevance score to initiate contact
min_relevance: 0.50
# Per-type limits and settings
types:
alert:
enabled: true
max_per_day: 5 # Alerts can exceed normal limits
min_relevance: 0.40 # Lower threshold for safety-related
override_quiet_hours: true # Weather warnings during DND
opportunity:
enabled: true
max_per_day: 2
min_relevance: 0.60
override_quiet_hours: false
revision:
enabled: true
max_per_day: 1
min_relevance: 0.50
override_quiet_hours: false
insight:
enabled: true
max_per_week: 2 # Weekly limit, not daily
min_relevance: 0.70
override_quiet_hours: false
anticipation:
enabled: true
lookahead_days: 10 # How far ahead to check
min_relevance: 0.50
override_quiet_hours: false
# ============================================================
# RISK MODEL — Controls proactive suggestion thresholds
# ============================================================
risk_model:
# Auto-execute threshold: high confidence AND low cost
auto_execute:
min_confidence: 0.90
max_cost_category: "none" # "none" | "low" | "medium" | "high"
# Suggest threshold: moderate+ confidence AND low-medium cost
suggest:
min_confidence: 0.50
max_cost_category: "medium"
# Casual mention threshold: any confidence, low cost
casual_mention:
min_confidence: 0.30
max_cost_category: "low"
# Defer threshold: everything else
# (implicit — anything not matching above is deferred)
# Cost categories for action types
action_costs:
create_reminder: "none"
suggest_product: "low"
spawn_monitoring_task: "low"
send_message_on_behalf: "high"
make_purchase: "high"
modify_calendar: "medium"
share_information: "medium"
# ============================================================
# BACKGROUND — Controls revision and inference workers
# ============================================================
background:
# Revision schedule (cron syntax, in user's timezone)
revision_schedule: "0 2 * * *" # 2:00 AM daily
# Inference schedule
inference_schedule: "0 3 * * *" # 3:00 AM daily
# Episode clustering schedule
clustering_schedule: "0 4 * * 0" # 4:00 AM Sundays
# Maximum revision batch size (edges per run)
revision_batch_size: 100
# Maximum inference chain depth (hops)
max_inference_depth: 3
# Maximum inferred edges per run
max_inferred_edges: 20
# Public fact verification — use web search to verify
enable_public_fact_verification: true
# Maximum web searches per revision cycle
max_verification_searches: 10
# ============================================================
# EVENT MONITOR — Controls external event monitoring
# ============================================================
event_monitor:
# Enable/disable entire event monitoring
enabled: true
# Source-specific settings
sources:
weather:
enabled: true
poll_interval_minutes: 30
severity_threshold: "warning" # "watch" | "warning" | "emergency"
news:
enabled: true
poll_interval_minutes: 15
max_articles_per_poll: 20
market:
enabled: false # Opt-in only
# poll_interval_minutes: 5
# change_threshold_percent: 2.0
calendar:
enabled: true
sync_interval_minutes: 5
lookahead_days: 14
# ============================================================
# LLM — Controls LLM usage for extraction and inference
# ============================================================
llm:
# Small LLM for extraction pipeline
small:
provider: "anthropic" # "anthropic" | "openai" | "local"
model: "claude-haiku-4-5-20251001"
max_tokens: 1024
temperature: 0.1 # Low for structured extraction
timeout_ms: 2000
retry_count: 2
fallback: "skip" # "skip" | "queue_retry" — what to do on failure
# Large LLM for inference and synthesis
large:
provider: "anthropic"
model: "claude-sonnet-4-5-20250929"
max_tokens: 4096
temperature: 0.3 # Slightly higher for creative inference
timeout_ms: 30000
retry_count: 1
fallback: "queue_retry"
# Token budget limits (per day)
daily_token_budget:
small: 500000 # ~500K tokens/day for extraction
large: 100000 # ~100K tokens/day for inference
# ============================================================
# STORAGE — Controls graph database and vector store
# ============================================================
storage:
neo4j:
uri: "bolt://localhost:7687"
database: "experience_memory"
max_connections: 20
query_timeout_ms: 5000
vector_store:
provider: "qdrant" # "qdrant" | "chroma"
uri: "http://localhost:6333"
collection: "episodes"
embedding_model: "text-embedding-3-small"
embedding_dimensions: 1536
queue:
provider: "redis" # "redis" | "nats"
uri: "redis://localhost:6379"
inbound_queue: "em:inbound"
outbound_queue: "em:outbound"
max_queue_size: 10000
ttl_hours: 72 # Messages expire after 72h
# ============================================================
# PRIVACY — Controls data classification and sharing
# ============================================================
privacy:
# Enable/disable experience sharing with other agents
sharing_enabled: false # Opt-in only
# Minimum privacy level for shared data
sharing_min_level: "L1" # Only L0 and L1 shared
# Enable/disable differential privacy on exports
differential_privacy: true
# Epsilon for differential privacy (lower = more private)
dp_epsilon: 1.0
# Auto-classify PII in extracted entities
auto_pii_detection: true
# Retention period for archived (low-confidence) edges
archive_retention_days: 365
# ============================================================
# OBSERVABILITY — Controls metrics and logging
# ============================================================
observability:
# Metrics export
metrics:
enabled: true
export_interval_seconds: 60
endpoint: "http://localhost:9090/metrics" # Prometheus
# Audit logging
audit_log:
enabled: true
log_extractions: true # Log every extraction operation
log_graph_mutations: true # Log every INSERT/REINFORCE/CONTRADICT
log_probes_delivered: true # Log every probe sent to user
log_starters_delivered: true
retention_days: 90
# Health check
health_check:
interval_seconds: 30
neo4j_timeout_ms: 1000
redis_timeout_ms: 500
vector_store_timeout_ms: 1000
16. Configuration Profiles — Proactivity Presets¶
The proactivity_level setting maps to a coherent set of overrides:
| Setting | Conservative | Balanced (Default) | Proactive |
|---|---|---|---|
probing.max_probes_per_conversation |
0 | 1 | 2 |
probing.max_probes_per_week |
3 | 10 | 20 |
probing.min_context_fit |
0.90 | 0.70 | 0.50 |
probing.min_turn_for_probe |
5 | 3 | 2 |
starters.enabled |
false | true | true |
starters.max_per_day |
0 | 3 | 5 |
starters.max_per_week |
0 | 10 | 25 |
starters.min_relevance |
N/A | 0.50 | 0.35 |
extraction.enable_indirect_inference |
false | true | true |
background.max_inferred_edges |
0 | 20 | 50 |
risk_model.auto_execute.min_confidence |
0.99 | 0.90 | 0.80 |
Users select a preset during onboarding. Power users can override individual settings.
graph LR
ONBOARD["Onboarding:<br/>'How proactive should<br/>your agent be?'"]
ONBOARD -->|"Conservative"| C["Minimal probing<br/>No starters<br/>No inference<br/>Only explicit facts"]
ONBOARD -->|"Balanced"| B["Contextual probing<br/>Moderate starters<br/>Inference enabled<br/>Standard thresholds"]
ONBOARD -->|"Proactive"| P["Frequent probing<br/>Active starters<br/>Deep inference<br/>Lower thresholds"]
C --> OVERRIDE["User can override<br/>any individual setting<br/>in Control UI"]
B --> OVERRIDE
P --> OVERRIDE
style B fill:#d4edda,stroke:#155724
style OVERRIDE fill:#fff3cd,stroke:#856404
17. Runtime Configuration Hot-Reload¶
Some configuration changes should take effect immediately without restarting the EM service. Others require a restart.
| Config Section | Hot-Reloadable | Why |
|---|---|---|
profile.* |
✅ Yes | User preferences change frequently |
probing.* |
✅ Yes | User may want to adjust probing mid-conversation |
starters.* |
✅ Yes | User may want to mute starters temporarily |
risk_model.* |
✅ Yes | Threshold tuning shouldn't require restart |
confidence.base_scores |
✅ Yes | Tuning during evaluation |
decay.* |
✅ Yes | Affects next decay cycle |
extraction.* |
⚠️ Partial | enable_indirect_inference is hot; max_entities requires pipeline restart |
background.*.schedule |
❌ No | Cron schedules registered at startup |
llm.* |
❌ No | Model changes require reconnection |
storage.* |
❌ No | Database connections established at startup |
sequenceDiagram
participant UI as Control UI
participant API as EM Config API
participant EM as EM Service
participant CACHE as Config Cache
UI->>API: PUT /config {probing: {max_per_day: 5}}
API->>API: Validate against schema
API->>CACHE: Update in-memory config
API->>EM: Signal: CONFIG_CHANGED
EM->>CACHE: Reload affected sections
EM-->>API: ACK: config updated
Note over EM: Next probe check uses<br/>new max_per_day = 5.<br/>No restart needed.
18. Configuration Validation Rules¶
| Rule | Scope | Error on Violation |
|---|---|---|
probing.max_probes_per_day ≤ probing.max_probes_per_week |
Cross-field | Config rejected |
starters.max_per_day ≤ starters.max_per_week |
Cross-field | Config rejected |
confidence.archive_threshold < extraction.min_storage_confidence |
Cross-field | Warning (facts stored but immediately at risk of archival) |
decay.grace_period_days > 0 |
Single field | Config rejected |
llm.small.temperature < 0.5 |
Single field | Warning (higher temp reduces extraction reliability) |
llm.daily_token_budget.small > 0 |
Single field | Config rejected |
privacy.dp_epsilon > 0 |
Single field | Config rejected |
quiet_hours.start ≠ quiet_hours.end |
Cross-field | Warning (empty quiet hours) |
All starters.types.*.min_relevance ≥ 0.0 and ≤ 1.0 |
Range | Config rejected |
background.max_inference_depth ≤ 5 |
Range | Warning (deep chains are expensive and speculative) |
19. Per-Entity Configuration Overrides¶
Some entities may need different treatment than the defaults. For example, the user might want aggressive monitoring for work-related entities but conservative handling for personal/family entities.
# Entity-level overrides (stored in Neo4j as node properties)
entity_overrides:
- entity_match: "Lena"
overrides:
probing:
max_probes_per_week: 2 # Be gentle about probing wife-related topics
risk_model:
suggest:
min_confidence: 0.70 # Higher bar for suggestions involving wife
- entity_match: "Acme Corp"
overrides:
starters:
types:
news:
enabled: true
min_relevance: 0.30 # Lower threshold — user wants all company news
- entity_match: "health"
entity_type: "concept"
overrides:
privacy:
sharing_min_level: "L4" # Never share health-related knowledge
extraction:
enable_indirect_inference: false # Don't infer health facts
20. Monitoring Configuration Effectiveness¶
The configuration needs its own feedback loop. SecretAI should track how configuration choices affect user engagement and satisfaction.
graph TB
subgraph Configuration Metrics
M1["Probe acceptance rate<br/>% of probes user responds to"]
M2["Starter engagement rate<br/>% of starters that lead to conversation"]
M3["Probe annoyance signal<br/>User ignores/deflects/mutes"]
M4["Graph growth rate<br/>New edges per week"]
M5["Correction rate<br/>How often user says 'that's wrong'"]
M6["Token efficiency<br/>Useful extractions per token spent"]
end
subgraph Signals
M1 -->|"Low acceptance"| S1["Consider: probes not relevant enough<br/>→ raise min_context_fit"]
M2 -->|"Low engagement"| S2["Consider: starters not valuable enough<br/>→ raise min_relevance"]
M3 -->|"High annoyance"| S3["Consider: too many probes<br/>→ reduce max_per_week"]
M4 -->|"Graph stagnant"| S4["Consider: extraction too conservative<br/>→ lower min_storage_confidence"]
M5 -->|"High corrections"| S5["Consider: extraction too aggressive<br/>→ raise confidence thresholds"]
M6 -->|"Low efficiency"| S6["Consider: wrong LLM or prompts<br/>→ tune extraction prompts"]
end
style M3 fill:#fdd,stroke:#c00
style M5 fill:#fdd,stroke:#c00
style S3 fill:#fff3cd,stroke:#856404
style S5 fill:#fff3cd,stroke:#856404
| Metric | Healthy Range | Action if Out of Range |
|---|---|---|
| Probe acceptance rate | 40–70% | Below 40%: raise min_context_fit. Above 70%: can lower slightly for more learning |
| Starter engagement rate | 30–60% | Below 30%: raise min_relevance or reduce frequency |
| Probe annoyance signal | < 20% | Above 20%: immediately reduce max_probes_per_week |
| Graph growth rate | 5–20 edges/week | Below 5: lower min_storage_confidence. Above 20: normal for active user |
| Correction rate | < 5% | Above 5%: extraction quality issue — audit pipeline |
| False positive rate (high-confidence wrong facts) | < 2% | Above 2%: critical — raise all confidence base scores |
21. Document Index¶
This document is part of the SecretAI Rails Experience Memory documentation series:
| Document | Description | Status |
|---|---|---|
| Experience Memory Summary | Technology overview, competitive positioning, roadmap | ✅ Complete |
| Security Architecture Comparison | Confidential VM vs. localhost trust model | ✅ Complete |
| Experience Memory Architecture | System diagrams, data flows, schemas, deployment | ✅ Complete |
| Integration, Testing, and Configuration | This document | ✅ Complete |
| Experience Memory API Reference | MCP tool specifications, gRPC protobuf definitions | 📋 Planned |
| Extraction Pipeline Tuning Guide | Per-stage configuration, LLM prompts, evaluation | 📋 Planned |
| Knowledge Graph Operations Manual | Neo4j migrations, backup/restore, performance | 📋 Planned |
SecretAI — Agents that learn. Memory that compounds. Privacy that's provable.