Summary
When the application uses streaming mode, TokenFirewall records outputTokens: 0 because it does not accumulate tokens from streamed response chunks. This significantly undercounts actual cost.
Current Behavior
For SSE streaming responses, inputTokens is recorded correctly from the request body. outputTokens is always 0 because the middleware does not process the streamed response body.
Expected Behavior
TokenFirewall should track output tokens for streaming responses by either reading the final usage chunk (when the provider includes it) or counting accumulated response text.
Proposed Fix
async function handleStreamingResponse(
response: ReadableStream,
metadata: RequestMetadata,
adapter: ProviderAdapter
): Promise<void> {
let outputText = '';
for await (const chunk of parseSSEStream(response)) {
outputText += chunk.delta?.content ?? '';
if (chunk.usage) {
await recordUsage(metadata, { outputTokens: chunk.usage.completion_tokens });
return;
}
}
const outputTokens = await adapter.countTokens(outputText, metadata.model);
await recordUsage(metadata, { outputTokens });
}
For OpenAI, pass stream_options: { include_usage: true } so the final SSE chunk contains usage data, avoiding the need for manual token counting.
Acceptance Criteria
Summary
When the application uses streaming mode, TokenFirewall records
outputTokens: 0because it does not accumulate tokens from streamed response chunks. This significantly undercounts actual cost.Current Behavior
For SSE streaming responses,
inputTokensis recorded correctly from the request body.outputTokensis always 0 because the middleware does not process the streamed response body.Expected Behavior
TokenFirewall should track output tokens for streaming responses by either reading the final
usagechunk (when the provider includes it) or counting accumulated response text.Proposed Fix
For OpenAI, pass
stream_options: { include_usage: true }so the final SSE chunk contains usage data, avoiding the need for manual token counting.Acceptance Criteria
outputTokensin the usage log.