[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI#17
[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI#17jonah-berman wants to merge 53 commits intomainfrom
Conversation
- baseURL: openrouter.ai -> api.cerebras.ai/v1 - model: google/gemini-2.5-flash -> gpt-oss-120b - updated server.js, api/chat.js, api/chat/stream.js, frontend App.jsx Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…value props - Replace OpenAI icon with actual Cerebras logo (orange arcs + C mark) - Add OpenAI logo alongside Cerebras logo - Remove How It Works link - Add model info subtext (gpt-oss-120b) - Add 4 Exa value prop bullets in header Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
| baseURL: "https://openrouter.ai/api/v1", | ||
| apiKey: process.env.OPEN_ROUTER_KEY, | ||
| baseURL: "https://api.cerebras.ai/v1", | ||
| apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x", |
There was a problem hiding this comment.
🔴 Hardcoded API key exposed in source code (server.js)
The Cerebras API key csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x is hardcoded as a fallback value in the source code. This key will be committed to version control and publicly exposed in the repository. The .gitignore explicitly excludes .env to keep secrets private, and CLAUDE.md instructs developers to "Add API keys to .env", making it clear that secrets should not be in source code. Anyone with access to the repo can use this key to make API calls at the owner's expense.
| apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x", | |
| apiKey: process.env.CEREBRAS_API_KEY, |
Was this helpful? React with 👍 or 👎 to provide feedback.
| baseURL: "https://openrouter.ai/api/v1", | ||
| apiKey: process.env.OPEN_ROUTER_KEY, | ||
| baseURL: "https://api.cerebras.ai/v1", | ||
| apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x", |
There was a problem hiding this comment.
🔴 Hardcoded API key exposed in source code (api/chat.js)
Same hardcoded Cerebras API key csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x is exposed in api/chat.js:6. This is the Vercel serverless function for the non-streaming chat endpoint. The key should be loaded exclusively from environment variables.
| apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x", | |
| apiKey: process.env.CEREBRAS_API_KEY, |
Was this helpful? React with 👍 or 👎 to provide feedback.
| baseURL: "https://openrouter.ai/api/v1", | ||
| apiKey: process.env.OPEN_ROUTER_KEY, | ||
| baseURL: "https://api.cerebras.ai/v1", | ||
| apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x", |
There was a problem hiding this comment.
🔴 Hardcoded API key exposed in source code (api/chat/stream.js)
Same hardcoded Cerebras API key csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x is exposed in api/chat/stream.js:6. This is the Vercel serverless function for the streaming chat endpoint. The key should be loaded exclusively from environment variables.
| apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x", | |
| apiKey: process.env.CEREBRAS_API_KEY, |
Was this helpful? React with 👍 or 👎 to provide feedback.
gpt-oss-120b is a reasoning model that by default prepends thinking tokens to content. Setting reasoning_format to 'hidden' drops them from the response, fixing the raw JSON/reasoning text leaking issue. Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…artifacts reasoning_format: hidden does not fully suppress reasoning when the model processes tool results. The model embeds JSON search/cursor objects and internal monologue in the content field. This adds cleanReasoningArtifacts() to find the last reasoning artifact and extract only the clean answer. Also buffers final response (after tool calls) instead of streaming it directly, since we need the full content to strip artifacts. Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…g and citation markers
The previous version only stripped JSON artifacts. gpt-oss-120b also outputs:
- Citation markers like {14†L0-L3} and 【1†L1-L4】
- Plain text reasoning lines ("The page could not be opened", "Now compile answer...")
New cleaning approach:
1. Strip citation markers (both curly brace and bracket styles)
2. Strip JSON reasoning artifacts (search_query, cursor, etc.)
3. Strip text reasoning markers ([Results], Search results:)
4. Find where the actual formatted answer starts (bold headings, tables, numbered lists)
5. Return only the clean answer
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
- Strip additional citation formats: {6}[7] and bare [1]
- Insert newline before mid-line **Bold to catch concatenated reasoning+answer
- Use regex-based markdown detection instead of line-by-line scanning
- Clean up double spaces after citation removal
- All 3 server files updated with identical logic
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
- Set maxDuration: 120s for both API functions in vercel.json - Send SSE heartbeat comments every 5s while buffering final response - Prevents Vercel function timeout during long Cerebras reasoning calls Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…ifact cleaning - Switch model from gpt-oss-120b (reasoning) to llama3.3-70b (instruction) - Remove cleanReasoningArtifacts() from all 3 server files - Remove reasoning_format: hidden from all API calls - Restore direct streaming for final response (no more buffering) - Remove maxDuration config (no longer needed without buffering) - Update frontend model info text Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…non-reasoning model) Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
llama3.1-8b on Cerebras outputs tool calls as content text when streaming instead of structured tool_calls deltas. Fix: use non-streaming for initial call (tool detection), stream only the final response. Also restore maxDuration=60 for Vercel functions. Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…alls field Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
… user query - Cuts Cerebras API calls from 3-4 to 2 per query (1 per pane) - No more tool call detection/parsing needed on Exa side - Search results injected as user message context instead of tool messages - Switch back to llama3.1-8b (gpt-oss-120b had stricter free-tier rate limits) - Keeps 429 retry with exponential backoff Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…r final response - OpenRouter generates optimized search queries using full system prompt + tool definition - Cerebras llama3.1-8b handles final response summarization (fast, no rate limit pressure) - Fallback to user query if OpenRouter doesn't generate tool calls - Keeps 429 retry with exponential backoff on Cerebras calls Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…cused summarize prompt, prioritize recent sources Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…sufficient Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…ode on refresh Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…reakdown (tool call/exa/synthesis/total) Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…on-people search Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
|
Closing — the Cerebras demo now lives in |
[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI
Summary
Replaces OpenRouter (Gemini 2.5 Flash) with Cerebras inference API using
llama3.1-8bacross all server-side files and the frontend. Implements a split-screen dual streaming architecture that sends each query to two parallel systems: Cerebras-only (left pane) vs. Cerebras+Exa search (right pane), with an Exa mode dropdown and latency tracking.Updates since last revision (commit 2be76a7)
UI refinements (
App.jsx):h-10— previously the right pane header was taller due to the inline mode toggle buttonsModeDropdownreplacesModeToggle: Exa mode selector is now a compact dropdown button (Instant ▾) that opens on click, instead of showing all three buttons inline. Closes on outside click.SourcesBannercomponent: When Exa search results return (typically in a few hundred ms), sources are shown immediately at the top of the message — with stacked favicons and expandable source list — rather than lingering on the "Searching..." state until streaming completes. Mirrors the UX pattern from the Exa highlight extension.Latency tracking improvements (
stream.js):searchExa()now capturesresponse.requestTimefrom the Exa SDK (the API's own reported processing time, in seconds → converted to ms). This is sent asexaServerTimeMsin bothsearch_completeanddoneSSE events. Frontend prefers this over the client-side round-trip measurement, matching how the Exa highlight extension reports latency.requestTimeon the response object,exaServerTimeMswill benulland the frontend falls back to client-sidesearchTimeMs. This needs verification with a live query.Split-screen dual streaming (commit 107fa5a)
Frontend rewrite (
App.jsx):LatencyBarcomponent: Styled after the Exa highlight extension with blue (#0040f0) millisecond values and Exa/Cerebras logosPromise.allSettled()— both panes fire simultaneously on query submitBackend changes (
api/chat/stream.js):searchExa()andsearchMultiple()now acceptsearchTypeparameter; default changed from"auto"to"instant"with highlights maxCharacters 4000exaModerequest body parameter:"instant"→ type instant,"fast"→ type keyword,"auto"→ type autoassistantprefix cleaning, returnstotalMsin done eventinitialCallMs(first Cerebras call for tool detection),finalCallMs(final Cerebras call after search),searchTimeMs(client-side round-trip), andexaServerTimeMs(Exa API's own timing) included in SSE done eventEarlier changes (Cerebras migration + robustness)
Backend changes:
baseURL:openrouter.ai/api/v1→api.cerebras.ai/v1(3 files:server.js,api/chat.js,api/chat/stream.js)apiKey:OPEN_ROUTER_KEYenv var →CEREBRAS_API_KEYenv var with hardcoded fallback (per requester's instruction — free-tier key)DEFAULT_MODEL:google/gemini-2.5-flash→llama3.1-8btryExtractToolCallFromContent): Two-layer parsing for when the model outputs tool calls as raw JSON in thecontentfield:JSON.parse— handles well-formed JSON in multiple envelope formatsllama3.1-8boutputs unescaped inner quotes. Extracts"query"fields via regex.assistantrole text that leaks into the final response after search results:okinitial comment and 3s heartbeat interval to prevent connection stallsmaxDuration: 120for both API functions invercel.jsonVercel environment:
EXA_API_KEYadded to Vercel preview environment variablesCEREBRAS_API_KEYadded to Vercel preview environment variablesReview & Testing Checklist for Human
exaServerTimeMsis properly captured from the Exa SDK, the "Exa: XXXms" value should be noticeably lower than the old client-side round-trip timing (typically ~100-300ms faster). If it falls back to client-side timing, this meansresponse.requestTimeisn't exposed by the SDK and needs investigation.Notes
instant→ Exa type instant,fast→ Exa type keyword,auto→ Exa type auto. The "fast" label maps to keyword search which is Exa's legacy search type (not neural/semantic).response.requestTimefield from the Exa SDK is not documented in our codebase — if it's not exposed, the latency will fall back to client-sidesearchTimeMs(Date.now() round-trip). This needs verification.App.jsx— completely new component structure. Vercel build passed but runtime testing in preview is critical.cerebrasLogoimport added in earlier commits (frontend/src/assets/cerebras-logo.svg) — confirmed present in diff.flex-1on both panes (50/50 split). On narrow screens (<768px), the two-column layout may be unusable — consider stacking panes vertically for mobile, but this was not implemented.