Skip to content

doc(memory): add documentation for token based memory and summarization#236

Open
sahiltyagiii wants to merge 1 commit into
juspay:releasefrom
sahiltyagiii:doc-for-token-based-memory
Open

doc(memory): add documentation for token based memory and summarization#236
sahiltyagiii wants to merge 1 commit into
juspay:releasefrom
sahiltyagiii:doc-for-token-based-memory

Conversation

@sahiltyagiii

@sahiltyagiii sahiltyagiii commented Nov 17, 2025

Copy link
Copy Markdown
Contributor

Pull Request

Description

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📚 Documentation update
  • 🧹 Code refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test coverage improvement
  • 🔧 Build/CI configuration change

Changes Made

  • Documentation added

AI Provider Impact

  • OpenAI
  • Anthropic
  • Google AI/Vertex
  • AWS Bedrock
  • Azure OpenAI
  • Hugging Face
  • Ollama
  • Mistral
  • All providers
  • No provider-specific changes

Component Impact

  • CLI
  • SDK
  • MCP Integration
  • Streaming
  • Tool Calling
  • Configuration
  • Documentation
  • Tests

Performance Impact

  • No performance impact
  • Performance improvement
  • Minor performance impact (acceptable)
  • Significant performance impact (needs discussion)

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published

Summary by CodeRabbit

  • New Features
    • Introduced token-based conversation memory with per-session thresholds (default 50k tokens)
    • Automatic background summarization of conversations when memory limits are reached
    • Per-session memory configuration customization for different use cases
    • Complete conversation history preservation with improved context tracking

Copilot AI review requested due to automatic review settings November 17, 2025 11:20
@coderabbitai

coderabbitai Bot commented Nov 17, 2025

Copy link
Copy Markdown

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

This pull request introduces a comprehensive design plan for token-based conversation memory management. It replaces turn-count logic with token thresholds, implements non-destructive pointer-based context tracking, defines token counting infrastructure with multiple providers, and specifies background summarization workflows with edge-case mitigations.

Changes

Cohort / File(s) Summary
Design & Planning Documentation
memory-bank/TOKEN-BASED-MEMORY-IMPLEMENTATION-PLAN.md
Comprehensive design specification outlining token-based memory architecture, including token thresholds, pointer-based context model, summarization workflows, edge-case handling, performance considerations, and directory layout proposals.
Core Conversation Types
src/lib/types/conversation.ts
Extends ChatMessage and SessionMemory types with fields for IDs (id), summarization metadata (isSummary, summarizesFrom, summarizesTo), message tracking (truncated, originalTokens), and session-level token management (summarizedUpToMessageId, tokenThreshold, lastTokenCount, lastCountedAt).
Message ID Utilities
src/lib/utils/messageId.ts
Adds generateMessageId() and generateSummaryId() functions for nanoid-based URL-safe ID generation with prefix convention (summary-).
Token Counting Service Infrastructure
src/lib/services/tokenCounting/types.ts, src/lib/services/tokenCounting/*
Defines TokenCounter interface with countTokens(), supportsAsync(), and supportedProviders() methods; specifies TokenCountInput and TokenCountResult types; establishes provider-based architecture with caching layer (5-min TTL) for latency optimization.
Configuration & Defaults
src/lib/types/configuration.ts, src/lib/config/conversationMemory.ts
Introduces ConversationMemoryConfig with tokenThreshold, summarizationProvider, summarizationModel fields; establishes default values (DEFAULT_TOKEN_THRESHOLD, MAX_SINGLE_MESSAGE_TOKENS, DEFAULT_SESSION_INACTIVITY_DAYS); supports per-session overrides.
Supporting Utilities
src/lib/utils/*, src/lib/types/*
Adds caching utilities, message normalization helpers, pointer/context utilities, and validation patterns for non-destructive memory operations.

Sequence Diagrams

sequenceDiagram
    participant App as Application
    participant Memory as Session Memory
    participant TokenSvc as Token Counter
    participant Summarizer as Summarization Service
    participant Cache as Cache (5min TTL)

    rect rgb(200, 220, 240)
    note over App,Cache: Background Summarization Flow
    App->>Memory: Check token count
    Memory->>TokenSvc: countTokens(messages, model)
    TokenSvc->>Cache: Check cached count
    alt Cache hit
        Cache-->>TokenSvc: Return cached result
    else Cache miss
        TokenSvc->>TokenSvc: Call provider (API or client)
        TokenSvc->>Cache: Store result with TTL
    end
    TokenSvc-->>Memory: TokenCountResult
    Memory->>Memory: threshold exceeded?
    alt Yes
        Memory->>Summarizer: Split messages & generate summary
        Summarizer-->>Memory: Summary message (isSummary=true)
        Memory->>Memory: Update summarizedUpToMessageId pointer
        Memory->>Memory: Insert summary in message list
    end
    end
Loading
sequenceDiagram
    participant App as Application
    participant Memory as Session Memory
    participant LLM as LLM Provider

    rect rgb(220, 240, 200)
    note over App,LLM: Context Building with Pointers
    App->>Memory: Build context for inference
    Memory->>Memory: Get summarizedUpToMessageId pointer
    alt Pointer exists
        Memory->>Memory: Fetch summary message (isSummary=true)
        Memory->>Memory: Collect messages after pointer
    else No pointer
        Memory->>Memory: Use all messages
    end
    Memory->>Memory: Validate context within tokenThreshold
    Memory-->>App: Complete context list
    App->>LLM: Send (system prompt + context + new message)
    LLM-->>App: Response
    App->>Memory: Append assistant message + user message
    Memory->>Memory: Update lastTokenCount, lastCountedAt
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Design complexity: Document proposes significant architectural shift from turn-count to token-based tracking; requires understanding of token counting providers, summarization triggers, and pointer-based memory model.
  • Scope & integration: Changes span multiple subsystems (types, utilities, services, configuration) with interdependencies; backward compatibility concerns with existing sessions noted.
  • Areas requiring extra attention:
    • Pointer corruption edge-case and recovery patterns (potential data inconsistency risks)
    • API failure handling in token counting and summarization workflows
    • Cache invalidation logic with 5-minute TTL implications
    • Per-session override mechanism interaction with global thresholds
    • Single large message truncation strategy and its impact on context quality

Possibly related PRs

Suggested reviewers

  • murdore

Poem

🐰 Tokens flow like river streams,
Pointers mark where memory gleams,
Summaries bloom, no loss, no waste,
Context preserved, forever traced! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: adding documentation for token-based memory and summarization. It is concise, clear, and follows conventional commit format.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds comprehensive documentation for implementing a token-based memory system as an improvement over the current turn-based conversation memory approach. The documentation outlines critical issues with the existing implementation and proposes a detailed solution using token counting, pointer-based context management, and non-destructive message storage.

Key Changes

  • Detailed problem statement identifying critical failures in the current turn-based memory system
  • Complete architecture design for token-based memory with provider-specific token counting implementations
  • Code examples demonstrating message validation, pointer-based context building, and summarization processes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

enableSummarization: process.env.NEUROLINK_SUMMARIZATION_ENABLED === "true",
tokenThreshold: Number(process.env.NEUROLINK_TOKEN_THRESHOLD) || DEFAULT_TOKEN_THRESHOLD,
summarizationProvider: (process.env.NEUROLINK_SUMMARIZATION_PROVIDER as AIProviderName) || "vertex",
summarizationModel: process.env.NEUROLINK_SUMMARIZATION_MODEL || "gemini-2.5-flash",

Copilot AI Nov 17, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default model 'gemini-2.5-flash' is inconsistent with the codebase. The actual default in src/lib/config/conversationMemory.ts uses 'gemini-2.5-flash', but many documentation files reference 'gemini-2.0-flash' models, and the model 'gemini-2.0-flash' (without version suffix) doesn't appear to exist in the codebase. Consider using 'gemini-2.0-flash-001' or clarifying if 'gemini-2.5-flash' is the intended default.

Suggested change
summarizationModel: process.env.NEUROLINK_SUMMARIZATION_MODEL || "gemini-2.5-flash",
summarizationModel: process.env.NEUROLINK_SUMMARIZATION_MODEL || "gemini-2.0-flash-001",

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, 2.5 flash will be used

NEUROLINK_SUMMARIZATION_PROVIDER=vertex
NEUROLINK_SUMMARIZATION_MODEL=gemini-2.5-flash
```

Copilot AI Nov 17, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This environment variable example uses 'gemini-2.5-flash' as the default, which matches the code implementation. However, ensure this is the intended model since much of the documentation uses 'gemini-2.0-flash' variants. If 'gemini-2.5-flash' is correct, consider adding a note explaining why this newer model is preferred over the 2.0 series.

Suggested change
> **Note:** The default summarization model is set to `gemini-2.5-flash` (not `gemini-2.0-flash` as referenced in some older documentation). This newer model is preferred due to improved performance, accuracy, and support for larger context windows. Ensure your environment supports `gemini-2.5-flash` before deploying.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, 2.5 flash will be used

@@ -0,0 +1,828 @@
# Token-Based Conversation Memory Implementation Plan

**Date:** 2025-11-17

Copilot AI Nov 17, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The date '2025-11-17' appears to be in the future (the current date is November 2025, so this would be November 17, 2025). If this document was created earlier, this date might be incorrect. Verify the intended date format and value.

Suggested change
**Date:** 2025-11-17
**Date:** 2025-11-01

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date is correct

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
memory-bank/TOKEN-BASED-MEMORY-IMPLEMENTATION-PLAN.md (5)

398-422: Token counting and summarization trigger logic is sound, but helper methods are not shown.

The flow correctly counts tokens from the pointer onwards and triggers summarization when threshold is exceeded. However, three critical methods are referenced but not defined: _findMessageIndex(), _findSplitIndexByTokens(), and _generateSummary().

Include method signatures and pseudocode/logic for these in the implementation plan, or defer their definition to follow-up issues.

Would you like me to draft method signatures and pseudocode for these three helpers?


544-584: Configuration schema is comprehensive, but environment variable naming lacks consistency.

The config schema mixes two naming conventions:

  • NEUROLINK_TOKEN_THRESHOLD (new style: explicit noun)
  • NEUROLINK_MEMORY_MAX_TURNS_PER_SESSION (old style: prefixed with subsystem)

For consistency, recommend:

  • New vars: NEUROLINK_MEMORY_TOKEN_THRESHOLD and NEUROLINK_MEMORY_SESSION_INACTIVITY_DAYS
  • Or document the naming rationale for reviewers.

Also, line 575 reads DEFAULT_TOKEN_THRESHOLD from env but the constant is defined on line 564. Clarify: is the env var meant to override the constant, or does the constant serve as a fallback?


697-771: Performance projections are realistic but presented as estimates; memory overhead may be underestimated.

The latency breakdown (lines 702–717) provides useful context, but the 1–200ms validation range and 2–400ms total blocking time are highly dependent on token counter provider choice and network conditions. Clearly label these as best/average/worst-case scenarios rather than guarantees.

The 80% LLM payload reduction (lines 765–768) is a helpful hypothetical, but note that this assumes a specific message distribution (100 messages → 20 recent). Real-world effectiveness depends on conversation patterns.

Memory overhead underestimated: Line 770 claims "~100 bytes per session," but the new fields include:

  • summarizedUpToMessageId: 21 bytes (nanoid)
  • tokenThreshold: 4–8 bytes
  • lastTokenCount: 4–8 bytes
  • lastCountedAt: 8 bytes
  • Per message: id (21 bytes) + metadata object (40–100 bytes minimum)

A session with 100 messages could see 4–6 KB additional overhead, not 100 bytes. Revise the estimate.


774-802: Environment variables are well-defined, but naming convention needs consistency check.

New variables are clearly documented (lines 778–782), and deprecated vars are marked for backward compatibility (lines 786–791). However, review the naming convention consistency across all config variables:

  • New: NEUROLINK_TOKEN_THRESHOLD (no subsystem prefix)
  • Old: NEUROLINK_MEMORY_MAX_TURNS_PER_SESSION (subsystem-prefixed)

Recommend either (a) prefixing all new vars with _MEMORY_, or (b) documenting why the new vars omit the subsystem prefix. This prevents future confusion when developers add more environment variables.


806-827: Success criteria are well-defined and testable, but one criterion may be unachievable with certain token counters.

Criteria 1–3, 5–6 are clearly measurable and map to the original problem statements. However:

Criterion 4 (Line 811): "<200ms overhead for token counting (with caching)" — This assumes cache hits or client-side token counters. API-based providers (Google, Anthropic, Bedrock) without cache hits can take 100–200ms per call. On first interaction or after cache expiration, this criterion may not be met.

Recommend revising to: "<200ms overhead for token counting on cache hits; ≤2s for cache misses, with automatic fallback to estimation to prevent exceeding context window."

Also, the related documentation section (lines 817–823) references external docs (/docs/CONVERSATION-MEMORY.md, etc.) that may need updates or deprecation notices when this implementation is merged. Document this dependency for the reviewer/implementer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60c35e0 and 92b28ee.

📒 Files selected for processing (1)
  • memory-bank/TOKEN-BASED-MEMORY-IMPLEMENTATION-PLAN.md (1 hunks)
🔇 Additional comments (8)
memory-bank/TOKEN-BASED-MEMORY-IMPLEMENTATION-PLAN.md (8)

8-67: Problem statement is clear and well-motivated; code references need verification.

The four issues (single long messages, turn-based ignoring tokens, destructive deletion, inflexible per-session config) are well-articulated with concrete scenarios. However, Line 52's reference to "line 184-187" in the existing implementation cannot be verified from the provided context. Once this document is applied, ensure that these references point to actual implementation locations.


72-101: Architecture design is sound and principles are well-aligned.

The pointer-based model with non-destructive storage elegantly addresses the problems stated. New fields on SessionMemory are justified and the principles map cleanly to the identified issues.


225-285: Type definitions are well-structured, but ID generation rationale needs documentation for future reference.

The ChatMessage and SessionMemory type updates cleanly capture the new pointer-based design. The nanoid choice over UUID is justified (57% smaller, URL-safe), which is good for storage efficiency. This rationale should be preserved in code comments when implemented.


294-342: Message validation truncation logic is pragmatic, but doesn't guarantee total context fit.

Truncating messages at 80% of threshold prevents individual message overflow, but doesn't guarantee that after adding the system prompt, tools, and other messages, the total stays within the window. The validation only checks the individual message; later checks (lines 403–414) catch overall overflow.

Document this as an intentional two-stage check: (1) per-message safety, (2) total-context overflow detection. Ensure downstream context-building also includes system prompts in token counts.


430-465: Pointer-based context building is clear and includes good fallback logic.

The function correctly handles missing pointers (returns all messages) and logs warnings for corrupted pointers. The slice-based extraction is efficient.

One note: Ensure system prompts are included when counting tokens for this context, as the TokenCountInput interface (line 190–194) accepts an optional systemPrompt parameter.


468-537: Summarization workflow is well-designed, but lacks post-summarization validation.

The split-point logic (keep 30% of threshold as recent context) is reasonable. However, after inserting the summary message (line 525), there is no re-validation that the new total token count (summary + recent messages) stays within threshold. If the summary itself is large, the context could exceed the threshold again.

Recommend adding a token recount after summary insertion and a fallback to truncate the summary if needed.


601-693: Edge case mitigations are comprehensive, but several implementation details are unclear.

The six edge cases (single message, API failure, summary failure, summary too long, pointer corruption, cache inconsistency) are realistic and well-motivated. However:

  1. Line 625: estimateTokens(messages) is called as a fallback, but it's unclear if this function exists in the codebase (mentioned as existing TokenUtils on line 155, but not shown).

  2. Line 637: When summarization fails, you log an error but don't clarify whether session.summarizedUpToMessageId is updated. If not updated, re-attempts will retry the same work. If updated anyway, the pointer advances without a summary. Specify the recovery behavior.

  3. Lines 651–658: Summary truncation uses 20% threshold (line 652), but the summarization split logic uses 30% (line 492). Document whether these percentages are intentionally different.

  4. Lines 685–692: Cache invalidation by clearing lastCountedAt could race with in-flight token counting requests. If a request completes after invalidation but before the flag is read, stale data may be used. Document whether serialization/locking is needed.


1-827: Overall: Comprehensive design with strong fundamentals; implementation details need clarification before coding.

This is a well-structured plan that addresses all four critical problems with the current turn-based system. The pointer-based, non-destructive approach is sound, and the token-counting infrastructure is thoroughly thought through.

Key strengths:

  • Clear problem statements with real-world scenarios
  • Non-destructive storage design eliminates audit concerns
  • Per-session configurability addresses flexibility gaps
  • Comprehensive edge-case mitigations

Key gaps before implementation:

  1. Undefined methods: _findMessageIndex(), _findSplitIndexByTokens(), _generateSummary() need detailed specs
  2. Anthropic SDK version: Verify @anthropic-ai/sdk ^0.30.0 supports token counting API
  3. Post-summarization validation: No re-count after summary insertion to ensure context still fits
  4. Env var naming: Inconsistency between old and new variable conventions
  5. Threshold percentage rationale: Document why 80%, 30%, and 20% splits are used for different purposes
  6. Async summarization timeout: No max duration specified for background summarization tasks

Recommendations:

  • Create a follow-up issue to detail the three helper methods before implementation begins
  • Add a security/verification step to confirm Anthropic SDK version capability
  • Add a post-summarization re-validation step to the workflow
  • Document the threshold percentage choices with constants in code

This is solid architectural work; these gaps are normal for a design phase and should be resolved during implementation planning.

Comment on lines +104 to +221
## Token Counting Infrastructure

### Provider-Specific Implementations

| Provider | Method | Accuracy | Latency | Status |
|----------|--------|----------|---------|--------|
| OpenAI | js-tiktoken (client) | 100% | ~1-5ms | ✅ Available |
| Google Gemini | SDK countTokens() | 100% | ~100-150ms | ✅ Available |
| Anthropic | Count Tokens API | 100% | ~200ms | ❌ Need SDK |
| AWS Bedrock | CountTokensCommand | 100% | ~150-200ms | ✅ Available |
| Azure OpenAI | js-tiktoken (client) | 100% | ~1-5ms | ✅ Available |
| Mistral | tiktoken approx | ~90% | ~1-5ms | ✅ Available |
| Others | Character estimation | ~60% | <1ms | ✅ Fallback |

### Performance Characteristics

**Token Counting Latency** (per conversation turn):

```typescript
// Best case: Client-side counting (OpenAI, Azure)
await countTokens(messages) // ~1-5ms ✅ Acceptable

// Average case: API call with caching (Google, Anthropic)
await countTokens(messages) // ~100-200ms ⚠️ Noticeable
// With 5-min cache: ~0.1ms ✅ Negligible

// Worst case: Cache miss + API failure
await countTokens(messages) // Fallback to estimation ~1ms ✅ Safe
```

**Summarization Latency**:
```typescript
// When threshold exceeded
await summarize(session) // ~2-5 seconds ⚠️ User waits

// Mitigation: Async summarization (don't block response)
```

### Architecture

```
src/lib/services/tokenCounting/
├── index.ts # Main export
├── types.ts # TokenCounter interface
├── tokenCounterFactory.ts # Provider selection
├── counters/
│ ├── openai.ts # js-tiktoken (already available)
│ ├── google.ts # Google SDK (already available)
│ ├── anthropic.ts # Anthropic SDK (need to add)
│ ├── bedrock.ts # AWS SDK (already available)
│ ├── mistral.ts # tiktoken approximation
│ ├── estimation.ts # Fallback (use existing TokenUtils)
│ └── index.ts
└── utils/
├── messageNormalizer.ts # Convert to provider format
├── cache.ts # 5-min TTL caching
└── index.ts
```

### Dependencies

```json
{
"dependencies": {
"@anthropic-ai/sdk": "^0.30.0", // NEW - for token counting
"nanoid": "^5.0.0" // NEW - for message IDs
}
}
```

**Already Available:**
- ✅ `js-tiktoken@1.0.21` (via langsmith)
- ✅ `@google/generative-ai@0.24.1`
- ✅ `@aws-sdk/client-bedrock-runtime`

### Token Counter Interface

```typescript
// src/lib/services/tokenCounting/types.ts
export interface TokenCounter {
countTokens(input: TokenCountInput): Promise<TokenCountResult>;
supportsAsync(): boolean;
supportedProviders(): AIProviderName[];
}

export interface TokenCountInput {
messages: CoreMessage[];
model: string;
systemPrompt?: string;
tools?: Tool[];
}

export interface TokenCountResult {
inputTokens: number;
estimatedOutputTokens?: number;
method: 'api' | 'client' | 'estimation';
accuracy: 'high' | 'medium' | 'low';
cached: boolean;
latency: number; // NEW: Track performance
}
```

### Caching Strategy

```typescript
// Cache key: provider + model + messages hash
const cacheKey = `${provider}:${model}:${hashMessages(messages)}`;

// TTL: 5 minutes (messages unlikely to change)
const cachedCount = await cache.get(cacheKey);
if (cachedCount) return cachedCount;

const startTime = Date.now();
const count = await countTokens(messages);
count.latency = Date.now() - startTime;

await cache.set(cacheKey, count, { ttl: 300 });
```

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Token counting infrastructure is comprehensive, but requires version and implementation verification.

The provider matrix and architecture are well-thought-out. However, several details need validation:

  1. Line 168: @anthropic-ai/sdk ^0.30.0 — verify this version exists and supports token counting API. The current Anthropic SDK may not yet expose this capability.
  2. Line 210: hashMessages() function is referenced but not defined. Specify the hashing algorithm (e.g., SHA-256 on message serialization) to ensure cache key collisions don't occur.
  3. Line 159: messageNormalizer.ts is listed but lacks detail on how provider-specific message format differences are handled (e.g., Anthropic vs. OpenAI message structures).
  4. Lines 212–214: Cache invalidation strategy doesn't address what happens when session messages are modified after a cache entry is created. Is the cache invalidated on any message change?

🌐 Web query:

Does Anthropic SDK @anthropic-ai/sdk ^0.30.0 include a token counting API?

💡 Result:

Short answer: No separate "token counting API" in @anthropic-ai/sdk ^0.30.0 — but the SDK returns token usage per request via the response. See the usage property (input_tokens/output_tokens) on message responses. [1][2]

Sources:

  • @anthropic-ai/sdk docs / npm package (shows usage output_tokens/input_tokens). [1]
  • GitHub / changelog for v0.30.0 (SDK features; no dedicated token-counting endpoint). [2]

Anthropic SDK architecture requires significant revision — token counting unavailable as pre-request API.

The web search confirms a critical architectural flaw: @anthropic-ai/sdk ^0.30.0 returns token usage via the response property (input_tokens/output_tokens), not via a standalone pre-request token counting endpoint. This fundamentally differs from your design assumption.

Key issues:

  1. Line 112–113: "Anthropic | Count Tokens API" cannot be implemented as designed. Anthropic only reports token counts after making a request, unlike OpenAI (client-side js-tiktoken) or Google (countTokens method). The architecture must be revised to defer Anthropic token counting or use post-request counts.

  2. Line 210: hashMessages() function is referenced but not defined. Specify the hashing algorithm (e.g., SHA-256) to ensure cache keys are collision-free.

  3. Line 159: messageNormalizer.ts lacks implementation detail on provider-specific message format handling.

  4. Lines 212–214: Cache invalidation strategy is unclear. Define behavior when session messages are modified after cache entry creation.

🤖 Prompt for AI Agents
memory-bank/TOKEN-BASED-MEMORY-IMPLEMENTATION-PLAN.md lines 112–113, 159, 210,
212–214: Anthropic cannot provide a pre-request token-count API so replace the
"Count Tokens API" entry with "post-request token reporting" and mark the
provider as supported only via response parsing or by estimating tokens
pre-request; implement the anthropic counter to read input_tokens/output_tokens
from responses and fall back to estimation when no recent response exists.
Implement hashMessages() using a deterministic serialization (stable JSON) +
SHA-256 and state that the serializer must normalize message order, IDs, and
whitespace before hashing. Flesh out messageNormalizer.ts to convert CoreMessage
to each provider’s required shape (field mapping, role/name normalization,
trimming, and special token handling) with unit tests for
OpenAI/Google/Anthropic. Clarify cache invalidation: include messages length and
lastEditedTimestamp (or a monotonic session version) in the cache key and state
that any edit/new message must bump the session version or invalidate the cache
immediately; keep the 5-minute TTL as a safety fallback.


### Provider-Specific Implementations

| Provider | Method | Accuracy | Latency | Status |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sahiltyagiii add for all provders we support. vertex litellm and others

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@murdore added bhaiya


```typescript
// src/lib/services/tokenCounting/types.ts
export interface TokenCounter {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No interface, only types and that too in the central locations where all other types are defined.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay bhaiya, updated the doc to use types, and add them to centeralized location

@@ -0,0 +1,828 @@
# Token-Based Conversation Memory Implementation Plan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach needs to mention how it will be enabled by default and what will be the configuration for default. it should be 80% context memory per model being used.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned default config, also will use getProviderTokenLimit to get the model's token limit and will use a 0.8x multiplier for it
updated the doc accordingly

@@ -0,0 +1,828 @@
# Token-Based Conversation Memory Implementation Plan

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add future plans of using multi-modality for summarization of the messages.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss this bhaiya

@@ -0,0 +1,828 @@
# Token-Based Conversation Memory Implementation Plan

**Date:** 2025-11-17

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another requirement is how will the implementation work? How do you make sure LLMs don't overflow the token coin text window?

Another requirement would be that for us to be able to identify 80 percent token window fill and keep summarizing till we are able to answer the query, it can be slow but this can be something that we will have to do in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this we will check the token count of message, and if it exceeds we can iteratively summarize it or set summarization prompt to strictly keep response in required tokens. If the message is too big for the summarizer LLM we can break it in chunks to summarize to prevent failing in worst cases.

updated the doc for it

@sahiltyagiii sahiltyagiii force-pushed the doc-for-token-based-memory branch from 92b28ee to 24eb611 Compare November 18, 2025 08:15
@sahiltyagiii sahiltyagiii requested a review from murdore November 18, 2025 08:25
Copilot AI review requested due to automatic review settings November 19, 2025 07:34
@sahiltyagiii sahiltyagiii force-pushed the doc-for-token-based-memory branch from 24eb611 to b926b79 Compare November 19, 2025 07:34

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants