Skip to content

[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI#17

Open
jonah-berman wants to merge 53 commits intomainfrom
devin/1773098223-cerebras-inference
Open

[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI#17
jonah-berman wants to merge 53 commits intomainfrom
devin/1773098223-cerebras-inference

Conversation

@jonah-berman
Copy link
Copy Markdown
Contributor

@jonah-berman jonah-berman commented Mar 9, 2026

[chatbot-demo]: Cerebras llama3.1-8b + split-screen dual streaming UI

Summary

Replaces OpenRouter (Gemini 2.5 Flash) with Cerebras inference API using llama3.1-8b across all server-side files and the frontend. Implements a split-screen dual streaming architecture that sends each query to two parallel systems: Cerebras-only (left pane) vs. Cerebras+Exa search (right pane), with an Exa mode dropdown and latency tracking.

Updates since last revision (commit 2be76a7)

UI refinements (App.jsx):

  • Identical header heights: Both pane headers now use fixed h-10 — previously the right pane header was taller due to the inline mode toggle buttons
  • ModeDropdown replaces ModeToggle: Exa mode selector is now a compact dropdown button (Instant ▾) that opens on click, instead of showing all three buttons inline. Closes on outside click.
  • SourcesBanner component: When Exa search results return (typically in a few hundred ms), sources are shown immediately at the top of the message — with stacked favicons and expandable source list — rather than lingering on the "Searching..." state until streaming completes. Mirrors the UX pattern from the Exa highlight extension.
  • Reduced chart generation: System prompt updated to only produce charts when the user explicitly asks, instead of proactively generating them for any numeric data.

Latency tracking improvements (stream.js):

  • Server-side Exa timing: searchExa() now captures response.requestTime from the Exa SDK (the API's own reported processing time, in seconds → converted to ms). This is sent as exaServerTimeMs in both search_complete and done SSE events. Frontend prefers this over the client-side round-trip measurement, matching how the Exa highlight extension reports latency.
  • ⚠️ Note: If the Exa SDK doesn't expose requestTime on the response object, exaServerTimeMs will be null and the frontend falls back to client-side searchTimeMs. This needs verification with a live query.

Split-screen dual streaming (commit 107fa5a)

Frontend rewrite (App.jsx):

  • Complete rewrite from single-chat to split-screen layout with two independently scrolling panes
  • Left pane: "Without Exa" — streams Cerebras responses with no web search (displays total latency)
  • Right pane: "With Exa" — streams Cerebras+Exa search responses (displays Exa ms, Cerebras ms, and total latency breakdown)
  • LatencyBar component: Styled after the Exa highlight extension with blue (#0040f0) millisecond values and Exa/Cerebras logos
  • Dual parallel streaming via Promise.allSettled() — both panes fire simultaneously on query submit
  • Shared query input pinned at bottom center
  • Removed OpenAI logo and value proposition bullets from earlier iterations

Backend changes (api/chat/stream.js):

  • searchExa() and searchMultiple() now accept searchType parameter; default changed from "auto" to "instant" with highlights maxCharacters 4000
  • New exaMode request body parameter: "instant" → type instant, "fast" → type keyword, "auto" → type auto
  • Fast path for non-Exa requests: Skips tool calling entirely — streams directly from Cerebras with assistant prefix cleaning, returns totalMs in done event
  • Latency tracking: initialCallMs (first Cerebras call for tool detection), finalCallMs (final Cerebras call after search), searchTimeMs (client-side round-trip), and exaServerTimeMs (Exa API's own timing) included in SSE done event

Earlier changes (Cerebras migration + robustness)

Backend changes:

  • baseURL: openrouter.ai/api/v1api.cerebras.ai/v1 (3 files: server.js, api/chat.js, api/chat/stream.js)
  • apiKey: OPEN_ROUTER_KEY env var → CEREBRAS_API_KEY env var with hardcoded fallback (per requester's instruction — free-tier key)
  • DEFAULT_MODEL: google/gemini-2.5-flashllama3.1-8b
  • Robust content-as-tool-call extraction (tryExtractToolCallFromContent): Two-layer parsing for when the model outputs tool calls as raw JSON in the content field:
    • Layer 1: Direct JSON.parse — handles well-formed JSON in multiple envelope formats
    • Layer 2: Regex fallback — handles malformed JSON where llama3.1-8b outputs unescaped inner quotes. Extracts "query" fields via regex.
  • Final response content filtering: Strips tool call JSON and stray assistant role text that leaks into the final response after search results
  • Empty response retry logic: If the model returns no content and no tool calls, retries the identical request once
  • Context truncation for follow-ups: Assistant messages truncated to 500 characters, history window reduced from 20 to 10 messages to prevent the 8B model from being overwhelmed
  • SSE heartbeat: Added :ok initial comment and 3s heartbeat interval to prevent connection stalls
  • maxDuration: 120 for both API functions in vercel.json

Vercel environment:

  • EXA_API_KEY added to Vercel preview environment variables
  • CEREBRAS_API_KEY added to Vercel preview environment variables

Review & Testing Checklist for Human

  • Verify server-side Exa latency is working — In the preview, send a query and check the right pane latency bar. If exaServerTimeMs is properly captured from the Exa SDK, the "Exa: XXXms" value should be noticeably lower than the old client-side round-trip timing (typically ~100-300ms faster). If it falls back to client-side timing, this means response.requestTime isn't exposed by the SDK and needs investigation.
  • Test mode dropdown functionality — Click the "Instant ▾" dropdown on the right pane. Verify it opens to show Instant/Fast/Auto options, closes on outside click, and correctly switches modes. Send identical queries in each mode to confirm Fast uses keyword search (faster, less accurate) vs Instant (neural, more accurate).
  • Verify instant source display — When you send a query, the right pane should transition from "Searching..." directly to a collapsible source banner (e.g., "Exa found 5 sources in 407ms") with stacked favicons, before the LLM response starts streaming. This should happen in the first few hundred milliseconds. Click to expand and verify sources are correct.
  • Check header alignment — Both pane headers should be exactly the same height (40px). Previously the right header was slightly taller.
  • Test split-screen dual streaming — Load the preview URL and send a query. Verify both panes stream responses in parallel — left pane should complete faster (no search), right pane should show the source banner then final response.
  • Verify no unsolicited charts — The system prompt now says "only include charts when the user EXPLICITLY asks". Send queries about numeric data (e.g., "what are the top 5 AI startups by funding?") and verify it returns prose instead of a chart block unless you explicitly say "show me a chart".

Notes

  • The mode dropdown mapping is: instant → Exa type instant, fast → Exa type keyword, auto → Exa type auto. The "fast" label maps to keyword search which is Exa's legacy search type (not neural/semantic).
  • The response.requestTime field from the Exa SDK is not documented in our codebase — if it's not exposed, the latency will fall back to client-side searchTimeMs (Date.now() round-trip). This needs verification.
  • The split-screen rewrite removed ~336 lines and added ~399 lines in App.jsx — completely new component structure. Vercel build passed but runtime testing in preview is critical.
  • The dual parallel requests may stress the free-tier Cerebras API key faster than the old single-request flow — monitor for rate limit errors.
  • cerebrasLogo import added in earlier commits (frontend/src/assets/cerebras-logo.svg) — confirmed present in diff.
  • The responsive split-screen layout uses flex-1 on both panes (50/50 split). On narrow screens (<768px), the two-column layout may be unusable — consider stacking panes vertically for mobile, but this was not implemented.
  • Requested by: @jonah-berman
  • Devin session

Open with Devin

devin-ai-integration bot and others added 2 commits March 9, 2026 23:18
- baseURL: openrouter.ai -> api.cerebras.ai/v1
- model: google/gemini-2.5-flash -> gpt-oss-120b
- updated server.js, api/chat.js, api/chat/stream.js, frontend App.jsx

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
chatbot-demo Ready Ready Preview, Comment Mar 11, 2026 9:42pm

…value props

- Replace OpenAI icon with actual Cerebras logo (orange arcs + C mark)
- Add OpenAI logo alongside Cerebras logo
- Remove How It Works link
- Add model info subtext (gpt-oss-120b)
- Add 4 Exa value prop bullets in header

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
@devin-ai-integration devin-ai-integration bot changed the title [chatbot-demo]: Switch inference to Cerebras API (gpt-oss-120b) [chatbot-demo]: Cerebras API + polished UI with logos and Exa value props Mar 9, 2026
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment thread server.js
baseURL: "https://openrouter.ai/api/v1",
apiKey: process.env.OPEN_ROUTER_KEY,
baseURL: "https://api.cerebras.ai/v1",
apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Hardcoded API key exposed in source code (server.js)

The Cerebras API key csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x is hardcoded as a fallback value in the source code. This key will be committed to version control and publicly exposed in the repository. The .gitignore explicitly excludes .env to keep secrets private, and CLAUDE.md instructs developers to "Add API keys to .env", making it clear that secrets should not be in source code. Anyone with access to the repo can use this key to make API calls at the owner's expense.

Suggested change
apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",
apiKey: process.env.CEREBRAS_API_KEY,
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread api/chat.js
baseURL: "https://openrouter.ai/api/v1",
apiKey: process.env.OPEN_ROUTER_KEY,
baseURL: "https://api.cerebras.ai/v1",
apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Hardcoded API key exposed in source code (api/chat.js)

Same hardcoded Cerebras API key csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x is exposed in api/chat.js:6. This is the Vercel serverless function for the non-streaming chat endpoint. The key should be loaded exclusively from environment variables.

Suggested change
apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",
apiKey: process.env.CEREBRAS_API_KEY,
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment thread api/chat/stream.js
baseURL: "https://openrouter.ai/api/v1",
apiKey: process.env.OPEN_ROUTER_KEY,
baseURL: "https://api.cerebras.ai/v1",
apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Hardcoded API key exposed in source code (api/chat/stream.js)

Same hardcoded Cerebras API key csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x is exposed in api/chat/stream.js:6. This is the Vercel serverless function for the streaming chat endpoint. The key should be loaded exclusively from environment variables.

Suggested change
apiKey: process.env.CEREBRAS_API_KEY || "csk-ctnvpnrpxw5t244c83c84pdecwk9tpfdp3jkvece9kve248x",
apiKey: process.env.CEREBRAS_API_KEY,
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

gpt-oss-120b is a reasoning model that by default prepends thinking
tokens to content. Setting reasoning_format to 'hidden' drops them
from the response, fixing the raw JSON/reasoning text leaking issue.

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…artifacts

reasoning_format: hidden does not fully suppress reasoning when the model
processes tool results. The model embeds JSON search/cursor objects and
internal monologue in the content field. This adds cleanReasoningArtifacts()
to find the last reasoning artifact and extract only the clean answer.

Also buffers final response (after tool calls) instead of streaming it
directly, since we need the full content to strip artifacts.

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…g and citation markers

The previous version only stripped JSON artifacts. gpt-oss-120b also outputs:
- Citation markers like {14†L0-L3} and 【1†L1-L4】
- Plain text reasoning lines ("The page could not be opened", "Now compile answer...")

New cleaning approach:
1. Strip citation markers (both curly brace and bracket styles)
2. Strip JSON reasoning artifacts (search_query, cursor, etc.)
3. Strip text reasoning markers ([Results], Search results:)
4. Find where the actual formatted answer starts (bold headings, tables, numbered lists)
5. Return only the clean answer

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
- Strip additional citation formats: {6}[7] and bare [1]
- Insert newline before mid-line **Bold to catch concatenated reasoning+answer
- Use regex-based markdown detection instead of line-by-line scanning
- Clean up double spaces after citation removal
- All 3 server files updated with identical logic

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
- Set maxDuration: 120s for both API functions in vercel.json
- Send SSE heartbeat comments every 5s while buffering final response
- Prevents Vercel function timeout during long Cerebras reasoning calls

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…ifact cleaning

- Switch model from gpt-oss-120b (reasoning) to llama3.3-70b (instruction)
- Remove cleanReasoningArtifacts() from all 3 server files
- Remove reasoning_format: hidden from all API calls
- Restore direct streaming for final response (no more buffering)
- Remove maxDuration config (no longer needed without buffering)
- Update frontend model info text

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…non-reasoning model)

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
llama3.1-8b on Cerebras outputs tool calls as content text when streaming
instead of structured tool_calls deltas. Fix: use non-streaming for initial
call (tool detection), stream only the final response. Also restore
maxDuration=60 for Vercel functions.

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
@devin-ai-integration devin-ai-integration bot changed the title [chatbot-demo]: Cerebras API + polished UI with logos and Exa value props [chatbot-demo]: Switch to Cerebras llama3.1-8b + polished UI Mar 10, 2026
…alls field

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
… user query

- Cuts Cerebras API calls from 3-4 to 2 per query (1 per pane)
- No more tool call detection/parsing needed on Exa side
- Search results injected as user message context instead of tool messages
- Switch back to llama3.1-8b (gpt-oss-120b had stricter free-tier rate limits)
- Keeps 429 retry with exponential backoff

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…r final response

- OpenRouter generates optimized search queries using full system prompt + tool definition
- Cerebras llama3.1-8b handles final response summarization (fast, no rate limit pressure)
- Fallback to user query if OpenRouter doesn't generate tool calls
- Keeps 429 retry with exponential backoff on Cerebras calls

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…cused summarize prompt, prioritize recent sources

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…sufficient

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…ode on refresh

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…reakdown (tool call/exa/synthesis/total)

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
…on-people search

Co-Authored-By: jonah@exa.ai <jonah@exa.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor

Closing — the Cerebras demo now lives in exa-labs/public-demos (merged via PR #19) and is live at https://exa.ai/demos-cerebras. This branch on the original chatbot-demo repo is no longer needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant