[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI by jonah-berman · Pull Request #17 · exa-labs/chatbot-demo

jonah-berman · 2026-03-09T23:20:57Z

[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI

Summary

Replaces OpenRouter (Gemini 2.5 Flash) with Cerebras inference API using llama3.1-8b across all server-side files and the frontend. Implements a split-screen dual streaming architecture that sends each query to two parallel systems: Cerebras-only (left pane) vs. Cerebras+Exa search (right pane), with an Exa mode dropdown and latency tracking.

Updates since last revision (commit `2be76a7`)

UI refinements (App.jsx):

Identical header heights: Both pane headers now use fixed h-10 — previously the right pane header was taller due to the inline mode toggle buttons
ModeDropdown replaces ModeToggle: Exa mode selector is now a compact dropdown button (Instant ▾) that opens on click, instead of showing all three buttons inline. Closes on outside click.
SourcesBanner component: When Exa search results return (typically in a few hundred ms), sources are shown immediately at the top of the message — with stacked favicons and expandable source list — rather than lingering on the "Searching..." state until streaming completes. Mirrors the UX pattern from the Exa highlight extension.
Reduced chart generation: System prompt updated to only produce charts when the user explicitly asks, instead of proactively generating them for any numeric data.

Latency tracking improvements (stream.js):

Server-side Exa timing: searchExa() now captures response.requestTime from the Exa SDK (the API's own reported processing time, in seconds → converted to ms). This is sent as exaServerTimeMs in both search_complete and done SSE events. Frontend prefers this over the client-side round-trip measurement, matching how the Exa highlight extension reports latency.
⚠️ Note: If the Exa SDK doesn't expose requestTime on the response object, exaServerTimeMs will be null and the frontend falls back to client-side searchTimeMs. This needs verification with a live query.

Split-screen dual streaming (commit `107fa5a`)

Frontend rewrite (App.jsx):

Complete rewrite from single-chat to split-screen layout with two independently scrolling panes
Left pane: "Without Exa" — streams Cerebras responses with no web search (displays total latency)
Right pane: "With Exa" — streams Cerebras+Exa search responses (displays Exa ms, Cerebras ms, and total latency breakdown)
LatencyBar component: Styled after the Exa highlight extension with blue (#0040f0) millisecond values and Exa/Cerebras logos
Dual parallel streaming via Promise.allSettled() — both panes fire simultaneously on query submit
Shared query input pinned at bottom center
Removed OpenAI logo and value proposition bullets from earlier iterations

Backend changes (api/chat/stream.js):

searchExa() and searchMultiple() now accept searchType parameter; default changed from "auto" to "instant" with highlights maxCharacters 4000
New exaMode request body parameter: "instant" → type instant, "fast" → type keyword, "auto" → type auto
Fast path for non-Exa requests: Skips tool calling entirely — streams directly from Cerebras with assistant prefix cleaning, returns totalMs in done event
Latency tracking: initialCallMs (first Cerebras call for tool detection), finalCallMs (final Cerebras call after search), searchTimeMs (client-side round-trip), and exaServerTimeMs (Exa API's own timing) included in SSE done event

Earlier changes (Cerebras migration + robustness)

Backend changes:

baseURL: openrouter.ai/api/v1 → api.cerebras.ai/v1 (3 files: server.js, api/chat.js, api/chat/stream.js)
apiKey: OPEN_ROUTER_KEY env var → CEREBRAS_API_KEY env var with hardcoded fallback (per requester's instruction — free-tier key)
DEFAULT_MODEL: google/gemini-2.5-flash → llama3.1-8b
Robust content-as-tool-call extraction (tryExtractToolCallFromContent): Two-layer parsing for when the model outputs tool calls as raw JSON in the content field:
- Layer 1: Direct JSON.parse — handles well-formed JSON in multiple envelope formats
- Layer 2: Regex fallback — handles malformed JSON where llama3.1-8b outputs unescaped inner quotes. Extracts "query" fields via regex.
Final response content filtering: Strips tool call JSON and stray assistant role text that leaks into the final response after search results
Empty response retry logic: If the model returns no content and no tool calls, retries the identical request once
Context truncation for follow-ups: Assistant messages truncated to 500 characters, history window reduced from 20 to 10 messages to prevent the 8B model from being overwhelmed
SSE heartbeat: Added :ok initial comment and 3s heartbeat interval to prevent connection stalls
maxDuration: 120 for both API functions in vercel.json

Vercel environment:

EXA_API_KEY added to Vercel preview environment variables
CEREBRAS_API_KEY added to Vercel preview environment variables

Review & Testing Checklist for Human

Verify server-side Exa latency is working — In the preview, send a query and check the right pane latency bar. If exaServerTimeMs is properly captured from the Exa SDK, the "Exa: XXXms" value should be noticeably lower than the old client-side round-trip timing (typically ~100-300ms faster). If it falls back to client-side timing, this means response.requestTime isn't exposed by the SDK and needs investigation.
Test mode dropdown functionality — Click the "Instant ▾" dropdown on the right pane. Verify it opens to show Instant/Fast/Auto options, closes on outside click, and correctly switches modes. Send identical queries in each mode to confirm Fast uses keyword search (faster, less accurate) vs Instant (neural, more accurate).
Verify instant source display — When you send a query, the right pane should transition from "Searching..." directly to a collapsible source banner (e.g., "Exa found 5 sources in 407ms") with stacked favicons, before the LLM response starts streaming. This should happen in the first few hundred milliseconds. Click to expand and verify sources are correct.
Check header alignment — Both pane headers should be exactly the same height (40px). Previously the right header was slightly taller.
Test split-screen dual streaming — Load the preview URL and send a query. Verify both panes stream responses in parallel — left pane should complete faster (no search), right pane should show the source banner then final response.
Verify no unsolicited charts — The system prompt now says "only include charts when the user EXPLICITLY asks". Send queries about numeric data (e.g., "what are the top 5 AI startups by funding?") and verify it returns prose instead of a chart block unless you explicitly say "show me a chart".

Notes

The mode dropdown mapping is: instant → Exa type instant, fast → Exa type keyword, auto → Exa type auto. The "fast" label maps to keyword search which is Exa's legacy search type (not neural/semantic).
The response.requestTime field from the Exa SDK is not documented in our codebase — if it's not exposed, the latency will fall back to client-side searchTimeMs (Date.now() round-trip). This needs verification.
The split-screen rewrite removed ~336 lines and added ~399 lines in App.jsx — completely new component structure. Vercel build passed but runtime testing in preview is critical.
The dual parallel requests may stress the free-tier Cerebras API key faster than the old single-request flow — monitor for rate limit errors.
cerebrasLogo import added in earlier commits (frontend/src/assets/cerebras-logo.svg) — confirmed present in diff.
The responsive split-screen layout uses flex-1 on both panes (50/50 split). On narrow screens (<768px), the two-column layout may be unusable — consider stacking panes vertically for mobile, but this was not implemented.
Requested by: @jonah-berman
Devin session

- baseURL: openrouter.ai -> api.cerebras.ai/v1 - model: google/gemini-2.5-flash -> gpt-oss-120b - updated server.js, api/chat.js, api/chat/stream.js, frontend App.jsx Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

devin-ai-integration · 2026-03-09T23:21:01Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

vercel · 2026-03-09T23:21:01Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
chatbot-demo	Ready	Preview, Comment	Mar 11, 2026 9:42pm

…value props - Replace OpenAI icon with actual Cerebras logo (orange arcs + C mark) - Add OpenAI logo alongside Cerebras logo - Remove How It Works link - Add model info subtext (gpt-oss-120b) - Add 4 Exa value prop bullets in header Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

devin-ai-integration

Devin Review found 3 potential issues.

View 4 additional findings in Devin Review.

devin-ai-integration · 2026-03-09T23:57:10Z

-  baseURL: "https://openrouter.ai/api/v1",
-  apiKey: process.env.OPEN_ROUTER_KEY,
+  baseURL: "https://api.cerebras.ai/v1",
+  apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",


🔴 Hardcoded API key exposed in source code (server.js)

The Cerebras API key csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x is hardcoded as a fallback value in the source code. This key will be committed to version control and publicly exposed in the repository. The .gitignore explicitly excludes .env to keep secrets private, and CLAUDE.md instructs developers to "Add API keys to .env", making it clear that secrets should not be in source code. Anyone with access to the repo can use this key to make API calls at the owner's expense.

Suggested change

apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",

apiKey: process.env.CEREBRAS_API_KEY,

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-09T23:57:12Z

-  baseURL: "https://openrouter.ai/api/v1",
-  apiKey: process.env.OPEN_ROUTER_KEY,
+  baseURL: "https://api.cerebras.ai/v1",
+  apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",


🔴 Hardcoded API key exposed in source code (api/chat.js)

Same hardcoded Cerebras API key csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x is exposed in api/chat.js:6. This is the Vercel serverless function for the non-streaming chat endpoint. The key should be loaded exclusively from environment variables.

Suggested change

apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",

apiKey: process.env.CEREBRAS_API_KEY,

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-09T23:57:13Z

-  baseURL: "https://openrouter.ai/api/v1",
-  apiKey: process.env.OPEN_ROUTER_KEY,
+  baseURL: "https://api.cerebras.ai/v1",
+  apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",


🔴 Hardcoded API key exposed in source code (api/chat/stream.js)

Same hardcoded Cerebras API key csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x is exposed in api/chat/stream.js:6. This is the Vercel serverless function for the streaming chat endpoint. The key should be loaded exclusively from environment variables.

Suggested change

apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",

apiKey: process.env.CEREBRAS_API_KEY,

Was this helpful? React with 👍 or 👎 to provide feedback.

gpt-oss-120b is a reasoning model that by default prepends thinking tokens to content. Setting reasoning_format to 'hidden' drops them from the response, fixing the raw JSON/reasoning text leaking issue. Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…artifacts reasoning_format: hidden does not fully suppress reasoning when the model processes tool results. The model embeds JSON search/cursor objects and internal monologue in the content field. This adds cleanReasoningArtifacts() to find the last reasoning artifact and extract only the clean answer. Also buffers final response (after tool calls) instead of streaming it directly, since we need the full content to strip artifacts. Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…g and citation markers The previous version only stripped JSON artifacts. gpt-oss-120b also outputs: - Citation markers like {14†L0-L3} and 【1†L1-L4】 - Plain text reasoning lines ("The page could not be opened", "Now compile answer...") New cleaning approach: 1. Strip citation markers (both curly brace and bracket styles) 2. Strip JSON reasoning artifacts (search_query, cursor, etc.) 3. Strip text reasoning markers ([Results], Search results:) 4. Find where the actual formatted answer starts (bold headings, tables, numbered lists) 5. Return only the clean answer Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

- Strip additional citation formats: {6}[7] and bare [1] - Insert newline before mid-line **Bold to catch concatenated reasoning+answer - Use regex-based markdown detection instead of line-by-line scanning - Clean up double spaces after citation removal - All 3 server files updated with identical logic Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

- Set maxDuration: 120s for both API functions in vercel.json - Send SSE heartbeat comments every 5s while buffering final response - Prevents Vercel function timeout during long Cerebras reasoning calls Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…ifact cleaning - Switch model from gpt-oss-120b (reasoning) to llama3.3-70b (instruction) - Remove cleanReasoningArtifacts() from all 3 server files - Remove reasoning_format: hidden from all API calls - Restore direct streaming for final response (no more buffering) - Remove maxDuration config (no longer needed without buffering) - Update frontend model info text Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…non-reasoning model) Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

llama3.1-8b on Cerebras outputs tool calls as content text when streaming instead of structured tool_calls deltas. Fix: use non-streaming for initial call (tool detection), stream only the final response. Also restore maxDuration=60 for Vercel functions. Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…alls field Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

… user query - Cuts Cerebras API calls from 3-4 to 2 per query (1 per pane) - No more tool call detection/parsing needed on Exa side - Search results injected as user message context instead of tool messages - Switch back to llama3.1-8b (gpt-oss-120b had stricter free-tier rate limits) - Keeps 429 retry with exponential backoff Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…r final response - OpenRouter generates optimized search queries using full system prompt + tool definition - Cerebras llama3.1-8b handles final response summarization (fast, no rate limit pressure) - Fallback to user query if OpenRouter doesn't generate tool calls - Keeps 429 retry with exponential backoff on Cerebras calls Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…cused summarize prompt, prioritize recent sources Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…sufficient Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…ode on refresh Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…reakdown (tool call/exa/synthesis/total) Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

…on-people search Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

devin-ai-integration · 2026-03-12T00:53:48Z

Closing — the Cerebras demo now lives in exa-labs/public-demos (merged via PR #19) and is live at https://exa.ai/demos-cerebras. This branch on the original chatbot-demo repo is no longer needed.

devin-ai-integration bot and others added 2 commits March 9, 2026 23:18

feat(chatbot-demo): switch inference from OpenRouter to Cerebras API

8b06a33

- baseURL: openrouter.ai -> api.cerebras.ai/v1 - model: google/gemini-2.5-flash -> gpt-oss-120b - updated server.js, api/chat.js, api/chat/stream.js, frontend App.jsx Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

feat(chatbot-demo): add Cerebras logo to UI header and assets

4be8cb3

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

devin-ai-integration bot assigned jonah-berman Mar 9, 2026

vercel bot deployed to Preview March 9, 2026 23:49 View deployment

devin-ai-integration bot changed the title ~~[chatbot-demo]: Switch inference to Cerebras API (gpt-oss-120b)~~ [chatbot-demo]: Cerebras API + polished UI with logos and Exa value props Mar 9, 2026

devin-ai-integration bot reviewed Mar 9, 2026

View reviewed changes

vercel bot deployed to Preview March 9, 2026 23:59 View deployment

vercel bot deployed to Preview March 10, 2026 00:14 View deployment

vercel bot deployed to Preview March 10, 2026 00:20 View deployment

vercel bot deployed to Preview March 10, 2026 00:26 View deployment

vercel bot deployed to Preview March 10, 2026 00:32 View deployment

vercel bot deployed to Preview March 10, 2026 00:44 View deployment

fix(chatbot-demo): correct model name to llama3.1-8b (only available …

d477d1b

…non-reasoning model) Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 10, 2026 00:49 View deployment

vercel bot deployed to Preview March 10, 2026 00:55 View deployment

fix: handle stringified nested JSON in llama3.1-8b tool call arguments

a9a02fc

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 10, 2026 01:02 View deployment

devin-ai-integration bot changed the title ~~[chatbot-demo]: Cerebras API + polished UI with logos and Exa value props~~ [chatbot-demo]: Switch to Cerebras llama3.1-8b + polished UI Mar 10, 2026

fix: parse tool calls from content text when llama3.1-8b skips tool_c…

19ba2a3

…alls field Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 10, 2026 01:19 View deployment

vercel bot deployed to Preview March 11, 2026 18:01 View deployment

fix: refresh button resets to starter screen instead of re-running query

e2792bd

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 18:05 View deployment

chore: change Super Bowl question to 'Who won the Super Bowl?'

0e93d4b

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 18:09 View deployment

feat: switch to gpt-oss-120b, add 429 retry with exponential backoff

8eeb5fb

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 18:36 View deployment

chore: default Exa mode to auto

6225e45

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 18:39 View deployment

vercel bot deployed to Preview March 11, 2026 18:43 View deployment

vercel bot deployed to Preview March 11, 2026 18:55 View deployment

chore: trigger redeploy with OPEN_ROUTER_KEY env var for preview

1cdc2d0

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 19:03 View deployment

fix: improve Cerebras summarization - increase highlight text, use fo…

c83fcab

…cused summarize prompt, prioritize recent sources Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 19:21 View deployment

fix: cleaner summarize prompt - no raw URLs/source blocks, no preamble

7344234

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 19:23 View deployment

fix: remove startPublishedDate filter from Exa calls - query year is …

e6f1e29

…sufficient Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 19:56 View deployment

feat: default to Exa Instant, add mode toggle to home screen, reset m…

6ae255d

…ode on refresh Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 20:08 View deployment

feat: revert to Cerebras for query generation, add detailed latency b…

4c8809b

…reakdown (tool call/exa/synthesis/total) Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 20:37 View deployment

style: add Cerebras logo next to Tool Call and Synthesis in latency bar

0b0ab3a

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 20:41 View deployment

fix: reduce people category usage in search prompts, require paired n…

60397a9

…on-people search Co-Authored-By: jonah@exa.ai <jonah@exa.ai>

vercel bot deployed to Preview March 11, 2026 21:42 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI#17

[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI#17
jonah-berman wants to merge 53 commits intomainfrom
devin/1773098223-cerebras-inference

jonah-berman commented Mar 9, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

devin-ai-integration bot commented Mar 9, 2026

Uh oh!

vercel bot commented Mar 9, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

devin-ai-integration bot Mar 9, 2026

Uh oh!

devin-ai-integration bot Mar 9, 2026

Uh oh!

devin-ai-integration bot Mar 9, 2026

Uh oh!

devin-ai-integration bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	apiKey: process.env.CEREBRAS_API_KEY \|\| "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",
	apiKey: process.env.CEREBRAS_API_KEY,

Conversation

jonah-berman commented Mar 9, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI

Summary

Updates since last revision (commit 2be76a7)

Split-screen dual streaming (commit 107fa5a)

Earlier changes (Cerebras migration + robustness)

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Mar 9, 2026

🤖 Devin AI Engineer

Uh oh!

vercel bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jonah-berman commented Mar 9, 2026 •

edited by devin-ai-integration bot

Loading

Updates since last revision (commit `2be76a7`)

Split-screen dual streaming (commit `107fa5a`)

vercel bot commented Mar 9, 2026 •

edited

Loading