doc(memory): add documentation for token based memory and summarization#236
doc(memory): add documentation for token based memory and summarization#236sahiltyagiii wants to merge 1 commit into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThis pull request introduces a comprehensive design plan for token-based conversation memory management. It replaces turn-count logic with token thresholds, implements non-destructive pointer-based context tracking, defines token counting infrastructure with multiple providers, and specifies background summarization workflows with edge-case mitigations. Changes
Sequence DiagramssequenceDiagram
participant App as Application
participant Memory as Session Memory
participant TokenSvc as Token Counter
participant Summarizer as Summarization Service
participant Cache as Cache (5min TTL)
rect rgb(200, 220, 240)
note over App,Cache: Background Summarization Flow
App->>Memory: Check token count
Memory->>TokenSvc: countTokens(messages, model)
TokenSvc->>Cache: Check cached count
alt Cache hit
Cache-->>TokenSvc: Return cached result
else Cache miss
TokenSvc->>TokenSvc: Call provider (API or client)
TokenSvc->>Cache: Store result with TTL
end
TokenSvc-->>Memory: TokenCountResult
Memory->>Memory: threshold exceeded?
alt Yes
Memory->>Summarizer: Split messages & generate summary
Summarizer-->>Memory: Summary message (isSummary=true)
Memory->>Memory: Update summarizedUpToMessageId pointer
Memory->>Memory: Insert summary in message list
end
end
sequenceDiagram
participant App as Application
participant Memory as Session Memory
participant LLM as LLM Provider
rect rgb(220, 240, 200)
note over App,LLM: Context Building with Pointers
App->>Memory: Build context for inference
Memory->>Memory: Get summarizedUpToMessageId pointer
alt Pointer exists
Memory->>Memory: Fetch summary message (isSummary=true)
Memory->>Memory: Collect messages after pointer
else No pointer
Memory->>Memory: Use all messages
end
Memory->>Memory: Validate context within tokenThreshold
Memory-->>App: Complete context list
App->>LLM: Send (system prompt + context + new message)
LLM-->>App: Response
App->>Memory: Append assistant message + user message
Memory->>Memory: Update lastTokenCount, lastCountedAt
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull Request Overview
This PR adds comprehensive documentation for implementing a token-based memory system as an improvement over the current turn-based conversation memory approach. The documentation outlines critical issues with the existing implementation and proposes a detailed solution using token counting, pointer-based context management, and non-destructive message storage.
Key Changes
- Detailed problem statement identifying critical failures in the current turn-based memory system
- Complete architecture design for token-based memory with provider-specific token counting implementations
- Code examples demonstrating message validation, pointer-based context building, and summarization processes
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| enableSummarization: process.env.NEUROLINK_SUMMARIZATION_ENABLED === "true", | ||
| tokenThreshold: Number(process.env.NEUROLINK_TOKEN_THRESHOLD) || DEFAULT_TOKEN_THRESHOLD, | ||
| summarizationProvider: (process.env.NEUROLINK_SUMMARIZATION_PROVIDER as AIProviderName) || "vertex", | ||
| summarizationModel: process.env.NEUROLINK_SUMMARIZATION_MODEL || "gemini-2.5-flash", |
There was a problem hiding this comment.
The default model 'gemini-2.5-flash' is inconsistent with the codebase. The actual default in src/lib/config/conversationMemory.ts uses 'gemini-2.5-flash', but many documentation files reference 'gemini-2.0-flash' models, and the model 'gemini-2.0-flash' (without version suffix) doesn't appear to exist in the codebase. Consider using 'gemini-2.0-flash-001' or clarifying if 'gemini-2.5-flash' is the intended default.
| summarizationModel: process.env.NEUROLINK_SUMMARIZATION_MODEL || "gemini-2.5-flash", | |
| summarizationModel: process.env.NEUROLINK_SUMMARIZATION_MODEL || "gemini-2.0-flash-001", |
There was a problem hiding this comment.
yes, 2.5 flash will be used
| NEUROLINK_SUMMARIZATION_PROVIDER=vertex | ||
| NEUROLINK_SUMMARIZATION_MODEL=gemini-2.5-flash | ||
| ``` | ||
|
|
There was a problem hiding this comment.
This environment variable example uses 'gemini-2.5-flash' as the default, which matches the code implementation. However, ensure this is the intended model since much of the documentation uses 'gemini-2.0-flash' variants. If 'gemini-2.5-flash' is correct, consider adding a note explaining why this newer model is preferred over the 2.0 series.
| > **Note:** The default summarization model is set to `gemini-2.5-flash` (not `gemini-2.0-flash` as referenced in some older documentation). This newer model is preferred due to improved performance, accuracy, and support for larger context windows. Ensure your environment supports `gemini-2.5-flash` before deploying. |
There was a problem hiding this comment.
yes, 2.5 flash will be used
| @@ -0,0 +1,828 @@ | |||
| # Token-Based Conversation Memory Implementation Plan | |||
|
|
|||
| **Date:** 2025-11-17 | |||
There was a problem hiding this comment.
The date '2025-11-17' appears to be in the future (the current date is November 2025, so this would be November 17, 2025). If this document was created earlier, this date might be incorrect. Verify the intended date format and value.
| **Date:** 2025-11-17 | |
| **Date:** 2025-11-01 |
There was a problem hiding this comment.
date is correct
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
memory-bank/TOKEN-BASED-MEMORY-IMPLEMENTATION-PLAN.md (5)
398-422: Token counting and summarization trigger logic is sound, but helper methods are not shown.The flow correctly counts tokens from the pointer onwards and triggers summarization when threshold is exceeded. However, three critical methods are referenced but not defined:
_findMessageIndex(),_findSplitIndexByTokens(), and_generateSummary().Include method signatures and pseudocode/logic for these in the implementation plan, or defer their definition to follow-up issues.
Would you like me to draft method signatures and pseudocode for these three helpers?
544-584: Configuration schema is comprehensive, but environment variable naming lacks consistency.The config schema mixes two naming conventions:
NEUROLINK_TOKEN_THRESHOLD(new style: explicit noun)NEUROLINK_MEMORY_MAX_TURNS_PER_SESSION(old style: prefixed with subsystem)For consistency, recommend:
- New vars:
NEUROLINK_MEMORY_TOKEN_THRESHOLDandNEUROLINK_MEMORY_SESSION_INACTIVITY_DAYS- Or document the naming rationale for reviewers.
Also, line 575 reads
DEFAULT_TOKEN_THRESHOLDfrom env but the constant is defined on line 564. Clarify: is the env var meant to override the constant, or does the constant serve as a fallback?
697-771: Performance projections are realistic but presented as estimates; memory overhead may be underestimated.The latency breakdown (lines 702–717) provides useful context, but the 1–200ms validation range and 2–400ms total blocking time are highly dependent on token counter provider choice and network conditions. Clearly label these as best/average/worst-case scenarios rather than guarantees.
The 80% LLM payload reduction (lines 765–768) is a helpful hypothetical, but note that this assumes a specific message distribution (100 messages → 20 recent). Real-world effectiveness depends on conversation patterns.
Memory overhead underestimated: Line 770 claims "~100 bytes per session," but the new fields include:
summarizedUpToMessageId: 21 bytes (nanoid)tokenThreshold: 4–8 byteslastTokenCount: 4–8 byteslastCountedAt: 8 bytes- Per message:
id(21 bytes) +metadataobject (40–100 bytes minimum)A session with 100 messages could see 4–6 KB additional overhead, not 100 bytes. Revise the estimate.
774-802: Environment variables are well-defined, but naming convention needs consistency check.New variables are clearly documented (lines 778–782), and deprecated vars are marked for backward compatibility (lines 786–791). However, review the naming convention consistency across all config variables:
- New:
NEUROLINK_TOKEN_THRESHOLD(no subsystem prefix)- Old:
NEUROLINK_MEMORY_MAX_TURNS_PER_SESSION(subsystem-prefixed)Recommend either (a) prefixing all new vars with
_MEMORY_, or (b) documenting why the new vars omit the subsystem prefix. This prevents future confusion when developers add more environment variables.
806-827: Success criteria are well-defined and testable, but one criterion may be unachievable with certain token counters.Criteria 1–3, 5–6 are clearly measurable and map to the original problem statements. However:
Criterion 4 (Line 811): "<200ms overhead for token counting (with caching)" — This assumes cache hits or client-side token counters. API-based providers (Google, Anthropic, Bedrock) without cache hits can take 100–200ms per call. On first interaction or after cache expiration, this criterion may not be met.
Recommend revising to: "<200ms overhead for token counting on cache hits; ≤2s for cache misses, with automatic fallback to estimation to prevent exceeding context window."
Also, the related documentation section (lines 817–823) references external docs (
/docs/CONVERSATION-MEMORY.md, etc.) that may need updates or deprecation notices when this implementation is merged. Document this dependency for the reviewer/implementer.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
memory-bank/TOKEN-BASED-MEMORY-IMPLEMENTATION-PLAN.md(1 hunks)
🔇 Additional comments (8)
memory-bank/TOKEN-BASED-MEMORY-IMPLEMENTATION-PLAN.md (8)
8-67: Problem statement is clear and well-motivated; code references need verification.The four issues (single long messages, turn-based ignoring tokens, destructive deletion, inflexible per-session config) are well-articulated with concrete scenarios. However, Line 52's reference to "line 184-187" in the existing implementation cannot be verified from the provided context. Once this document is applied, ensure that these references point to actual implementation locations.
72-101: Architecture design is sound and principles are well-aligned.The pointer-based model with non-destructive storage elegantly addresses the problems stated. New fields on SessionMemory are justified and the principles map cleanly to the identified issues.
225-285: Type definitions are well-structured, but ID generation rationale needs documentation for future reference.The ChatMessage and SessionMemory type updates cleanly capture the new pointer-based design. The nanoid choice over UUID is justified (57% smaller, URL-safe), which is good for storage efficiency. This rationale should be preserved in code comments when implemented.
294-342: Message validation truncation logic is pragmatic, but doesn't guarantee total context fit.Truncating messages at 80% of threshold prevents individual message overflow, but doesn't guarantee that after adding the system prompt, tools, and other messages, the total stays within the window. The validation only checks the individual message; later checks (lines 403–414) catch overall overflow.
Document this as an intentional two-stage check: (1) per-message safety, (2) total-context overflow detection. Ensure downstream context-building also includes system prompts in token counts.
430-465: Pointer-based context building is clear and includes good fallback logic.The function correctly handles missing pointers (returns all messages) and logs warnings for corrupted pointers. The slice-based extraction is efficient.
One note: Ensure system prompts are included when counting tokens for this context, as the TokenCountInput interface (line 190–194) accepts an optional
systemPromptparameter.
468-537: Summarization workflow is well-designed, but lacks post-summarization validation.The split-point logic (keep 30% of threshold as recent context) is reasonable. However, after inserting the summary message (line 525), there is no re-validation that the new total token count (summary + recent messages) stays within threshold. If the summary itself is large, the context could exceed the threshold again.
Recommend adding a token recount after summary insertion and a fallback to truncate the summary if needed.
601-693: Edge case mitigations are comprehensive, but several implementation details are unclear.The six edge cases (single message, API failure, summary failure, summary too long, pointer corruption, cache inconsistency) are realistic and well-motivated. However:
Line 625:
estimateTokens(messages)is called as a fallback, but it's unclear if this function exists in the codebase (mentioned as existing TokenUtils on line 155, but not shown).Line 637: When summarization fails, you log an error but don't clarify whether
session.summarizedUpToMessageIdis updated. If not updated, re-attempts will retry the same work. If updated anyway, the pointer advances without a summary. Specify the recovery behavior.Lines 651–658: Summary truncation uses 20% threshold (line 652), but the summarization split logic uses 30% (line 492). Document whether these percentages are intentionally different.
Lines 685–692: Cache invalidation by clearing
lastCountedAtcould race with in-flight token counting requests. If a request completes after invalidation but before the flag is read, stale data may be used. Document whether serialization/locking is needed.
1-827: Overall: Comprehensive design with strong fundamentals; implementation details need clarification before coding.This is a well-structured plan that addresses all four critical problems with the current turn-based system. The pointer-based, non-destructive approach is sound, and the token-counting infrastructure is thoroughly thought through.
Key strengths:
- Clear problem statements with real-world scenarios
- Non-destructive storage design eliminates audit concerns
- Per-session configurability addresses flexibility gaps
- Comprehensive edge-case mitigations
Key gaps before implementation:
- Undefined methods:
_findMessageIndex(),_findSplitIndexByTokens(),_generateSummary()need detailed specs- Anthropic SDK version: Verify @anthropic-ai/sdk ^0.30.0 supports token counting API
- Post-summarization validation: No re-count after summary insertion to ensure context still fits
- Env var naming: Inconsistency between old and new variable conventions
- Threshold percentage rationale: Document why 80%, 30%, and 20% splits are used for different purposes
- Async summarization timeout: No max duration specified for background summarization tasks
Recommendations:
- Create a follow-up issue to detail the three helper methods before implementation begins
- Add a security/verification step to confirm Anthropic SDK version capability
- Add a post-summarization re-validation step to the workflow
- Document the threshold percentage choices with constants in code
This is solid architectural work; these gaps are normal for a design phase and should be resolved during implementation planning.
| ## Token Counting Infrastructure | ||
|
|
||
| ### Provider-Specific Implementations | ||
|
|
||
| | Provider | Method | Accuracy | Latency | Status | | ||
| |----------|--------|----------|---------|--------| | ||
| | OpenAI | js-tiktoken (client) | 100% | ~1-5ms | ✅ Available | | ||
| | Google Gemini | SDK countTokens() | 100% | ~100-150ms | ✅ Available | | ||
| | Anthropic | Count Tokens API | 100% | ~200ms | ❌ Need SDK | | ||
| | AWS Bedrock | CountTokensCommand | 100% | ~150-200ms | ✅ Available | | ||
| | Azure OpenAI | js-tiktoken (client) | 100% | ~1-5ms | ✅ Available | | ||
| | Mistral | tiktoken approx | ~90% | ~1-5ms | ✅ Available | | ||
| | Others | Character estimation | ~60% | <1ms | ✅ Fallback | | ||
|
|
||
| ### Performance Characteristics | ||
|
|
||
| **Token Counting Latency** (per conversation turn): | ||
|
|
||
| ```typescript | ||
| // Best case: Client-side counting (OpenAI, Azure) | ||
| await countTokens(messages) // ~1-5ms ✅ Acceptable | ||
|
|
||
| // Average case: API call with caching (Google, Anthropic) | ||
| await countTokens(messages) // ~100-200ms ⚠️ Noticeable | ||
| // With 5-min cache: ~0.1ms ✅ Negligible | ||
|
|
||
| // Worst case: Cache miss + API failure | ||
| await countTokens(messages) // Fallback to estimation ~1ms ✅ Safe | ||
| ``` | ||
|
|
||
| **Summarization Latency**: | ||
| ```typescript | ||
| // When threshold exceeded | ||
| await summarize(session) // ~2-5 seconds ⚠️ User waits | ||
|
|
||
| // Mitigation: Async summarization (don't block response) | ||
| ``` | ||
|
|
||
| ### Architecture | ||
|
|
||
| ``` | ||
| src/lib/services/tokenCounting/ | ||
| ├── index.ts # Main export | ||
| ├── types.ts # TokenCounter interface | ||
| ├── tokenCounterFactory.ts # Provider selection | ||
| ├── counters/ | ||
| │ ├── openai.ts # js-tiktoken (already available) | ||
| │ ├── google.ts # Google SDK (already available) | ||
| │ ├── anthropic.ts # Anthropic SDK (need to add) | ||
| │ ├── bedrock.ts # AWS SDK (already available) | ||
| │ ├── mistral.ts # tiktoken approximation | ||
| │ ├── estimation.ts # Fallback (use existing TokenUtils) | ||
| │ └── index.ts | ||
| └── utils/ | ||
| ├── messageNormalizer.ts # Convert to provider format | ||
| ├── cache.ts # 5-min TTL caching | ||
| └── index.ts | ||
| ``` | ||
|
|
||
| ### Dependencies | ||
|
|
||
| ```json | ||
| { | ||
| "dependencies": { | ||
| "@anthropic-ai/sdk": "^0.30.0", // NEW - for token counting | ||
| "nanoid": "^5.0.0" // NEW - for message IDs | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **Already Available:** | ||
| - ✅ `js-tiktoken@1.0.21` (via langsmith) | ||
| - ✅ `@google/generative-ai@0.24.1` | ||
| - ✅ `@aws-sdk/client-bedrock-runtime` | ||
|
|
||
| ### Token Counter Interface | ||
|
|
||
| ```typescript | ||
| // src/lib/services/tokenCounting/types.ts | ||
| export interface TokenCounter { | ||
| countTokens(input: TokenCountInput): Promise<TokenCountResult>; | ||
| supportsAsync(): boolean; | ||
| supportedProviders(): AIProviderName[]; | ||
| } | ||
|
|
||
| export interface TokenCountInput { | ||
| messages: CoreMessage[]; | ||
| model: string; | ||
| systemPrompt?: string; | ||
| tools?: Tool[]; | ||
| } | ||
|
|
||
| export interface TokenCountResult { | ||
| inputTokens: number; | ||
| estimatedOutputTokens?: number; | ||
| method: 'api' | 'client' | 'estimation'; | ||
| accuracy: 'high' | 'medium' | 'low'; | ||
| cached: boolean; | ||
| latency: number; // NEW: Track performance | ||
| } | ||
| ``` | ||
|
|
||
| ### Caching Strategy | ||
|
|
||
| ```typescript | ||
| // Cache key: provider + model + messages hash | ||
| const cacheKey = `${provider}:${model}:${hashMessages(messages)}`; | ||
|
|
||
| // TTL: 5 minutes (messages unlikely to change) | ||
| const cachedCount = await cache.get(cacheKey); | ||
| if (cachedCount) return cachedCount; | ||
|
|
||
| const startTime = Date.now(); | ||
| const count = await countTokens(messages); | ||
| count.latency = Date.now() - startTime; | ||
|
|
||
| await cache.set(cacheKey, count, { ttl: 300 }); | ||
| ``` |
There was a problem hiding this comment.
🧩 Analysis chain
Token counting infrastructure is comprehensive, but requires version and implementation verification.
The provider matrix and architecture are well-thought-out. However, several details need validation:
- Line 168:
@anthropic-ai/sdk ^0.30.0— verify this version exists and supports token counting API. The current Anthropic SDK may not yet expose this capability. - Line 210:
hashMessages()function is referenced but not defined. Specify the hashing algorithm (e.g., SHA-256 on message serialization) to ensure cache key collisions don't occur. - Line 159:
messageNormalizer.tsis listed but lacks detail on how provider-specific message format differences are handled (e.g., Anthropic vs. OpenAI message structures). - Lines 212–214: Cache invalidation strategy doesn't address what happens when session messages are modified after a cache entry is created. Is the cache invalidated on any message change?
🌐 Web query:
Does Anthropic SDK @anthropic-ai/sdk ^0.30.0 include a token counting API?
💡 Result:
Short answer: No separate "token counting API" in @anthropic-ai/sdk ^0.30.0 — but the SDK returns token usage per request via the response. See the usage property (input_tokens/output_tokens) on message responses. [1][2]
Sources:
- @anthropic-ai/sdk docs / npm package (shows usage output_tokens/input_tokens). [1]
- GitHub / changelog for v0.30.0 (SDK features; no dedicated token-counting endpoint). [2]
Anthropic SDK architecture requires significant revision — token counting unavailable as pre-request API.
The web search confirms a critical architectural flaw: @anthropic-ai/sdk ^0.30.0 returns token usage via the response property (input_tokens/output_tokens), not via a standalone pre-request token counting endpoint. This fundamentally differs from your design assumption.
Key issues:
-
Line 112–113: "Anthropic | Count Tokens API" cannot be implemented as designed. Anthropic only reports token counts after making a request, unlike OpenAI (client-side js-tiktoken) or Google (countTokens method). The architecture must be revised to defer Anthropic token counting or use post-request counts.
-
Line 210:
hashMessages()function is referenced but not defined. Specify the hashing algorithm (e.g., SHA-256) to ensure cache keys are collision-free. -
Line 159:
messageNormalizer.tslacks implementation detail on provider-specific message format handling. -
Lines 212–214: Cache invalidation strategy is unclear. Define behavior when session messages are modified after cache entry creation.
🤖 Prompt for AI Agents
memory-bank/TOKEN-BASED-MEMORY-IMPLEMENTATION-PLAN.md lines 112–113, 159, 210,
212–214: Anthropic cannot provide a pre-request token-count API so replace the
"Count Tokens API" entry with "post-request token reporting" and mark the
provider as supported only via response parsing or by estimating tokens
pre-request; implement the anthropic counter to read input_tokens/output_tokens
from responses and fall back to estimation when no recent response exists.
Implement hashMessages() using a deterministic serialization (stable JSON) +
SHA-256 and state that the serializer must normalize message order, IDs, and
whitespace before hashing. Flesh out messageNormalizer.ts to convert CoreMessage
to each provider’s required shape (field mapping, role/name normalization,
trimming, and special token handling) with unit tests for
OpenAI/Google/Anthropic. Clarify cache invalidation: include messages length and
lastEditedTimestamp (or a monotonic session version) in the cache key and state
that any edit/new message must bump the session version or invalidate the cache
immediately; keep the 5-minute TTL as a safety fallback.
|
|
||
| ### Provider-Specific Implementations | ||
|
|
||
| | Provider | Method | Accuracy | Latency | Status | |
There was a problem hiding this comment.
@sahiltyagiii add for all provders we support. vertex litellm and others
|
|
||
| ```typescript | ||
| // src/lib/services/tokenCounting/types.ts | ||
| export interface TokenCounter { |
There was a problem hiding this comment.
No interface, only types and that too in the central locations where all other types are defined.
There was a problem hiding this comment.
okay bhaiya, updated the doc to use types, and add them to centeralized location
| @@ -0,0 +1,828 @@ | |||
| # Token-Based Conversation Memory Implementation Plan | |||
There was a problem hiding this comment.
Approach needs to mention how it will be enabled by default and what will be the configuration for default. it should be 80% context memory per model being used.
There was a problem hiding this comment.
mentioned default config, also will use getProviderTokenLimit to get the model's token limit and will use a 0.8x multiplier for it
updated the doc accordingly
| @@ -0,0 +1,828 @@ | |||
| # Token-Based Conversation Memory Implementation Plan | |||
|
|
|||
There was a problem hiding this comment.
We need to add future plans of using multi-modality for summarization of the messages.
There was a problem hiding this comment.
let's discuss this bhaiya
| @@ -0,0 +1,828 @@ | |||
| # Token-Based Conversation Memory Implementation Plan | |||
|
|
|||
| **Date:** 2025-11-17 | |||
There was a problem hiding this comment.
Another requirement is how will the implementation work? How do you make sure LLMs don't overflow the token coin text window?
Another requirement would be that for us to be able to identify 80 percent token window fill and keep summarizing till we are able to answer the query, it can be slow but this can be something that we will have to do in the future.
There was a problem hiding this comment.
for this we will check the token count of message, and if it exceeds we can iteratively summarize it or set summarization prompt to strictly keep response in required tokens. If the message is too big for the summarizer LLM we can break it in chunks to summarize to prevent failing in worst cases.
updated the doc for it
92b28ee to
24eb611
Compare
24eb611 to
b926b79
Compare
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Pull Request
Description
Type of Change
Changes Made
AI Provider Impact
Component Impact
Performance Impact
Checklist
Summary by CodeRabbit