Jarvis is a real-time, multimodal AI assistant built with React and the Google GenAI SDK. It integrates live audio and video streaming with advanced AI capabilities, including real-time conversation, internet search, and image generation. The application mimics a futuristic "Jarvis-like" interface, providing immediate visual and auditory feedback.
- Frontend Framework: React 18 with TypeScript
- Build Tool: Vite
- Styling: Tailwind CSS with custom animations
- AI Integration: Google GenAI SDK (
@google/genai) - Icons: Lucide React
The application is architected around a central LiveService that manages the persistent connection to the Google Gemini API. The frontend components react to state changes driven by this service.
This is the heart of the application. It handles:
- Connection Management: Establishes and maintains the WebSocket/WebRTC session with the Gemini model (
gemini-2.5-flash-native-audio-preview-09-2025). - Audio Processing:
- Captures microphone input using
AudioContextandScriptProcessorNode. - Converts audio data to PCM16 format for the model.
- Plays back response audio from the model.
- Captures microphone input using
- Video Processing: Receives camera frames and transmits them as real-time input to the model to provide vision capabilities.
- Tool Execution: Intercepts function calls from the model (e.g., search, image generation) and delegates them to the
ToolService.
Handles the execution of specific tools requested by the AI model:
- Search: Uses
gemini-2.5-flashwith thegoogleSearchtool to retrieve real-time information. - Image Generation: Uses (Nano Banana Pro)
gemini-3-pro-image-previewto generate images from text prompts. - Image Reimagination: Uses (Nano Banana Pro)
gemini-3-pro-image-previewto modify or "reimagine" the user's camera feed based on a prompt.
App.tsx: The main controller component. It initializes theLiveService, manages global state (connection status, volume, message logs), and orchestrates the UI layout.components/CameraFeed.tsx: Manages the webcam video stream. It extracts frames at a regular interval to send to the AI model.components/Visualizer.tsx: Renders a real-time audio visualizer based on the volume levels provided by theLiveService.
- Initialization: The user provides an API key (via
aistudioinjection or selection). - Connection:
LiveServiceconnects to the Gemini Multimodal Live API. - Input Loop:
- Audio: Microphone data is constantly buffered, converted, and streamed to the model.
- Video:
CameraFeedcaptures frames, whichLiveServicesends as image data chunks.
- Model Processing: The Gemini model processes the audio and visual inputs in context.
- Output Loop:
- Audio Response: The model streams audio chunks back, which are queued and played by
LiveService. - Tool Calls: If the model determines a tool is needed (e.g., "draw a cat"), it sends a function call.
LiveServiceexecutes the function viaToolService, sends the result back to the model, and updates the UI (e.g., displaying the generated image).
- Audio Response: The model streams audio chunks back, which are queued and played by
components/: UI components for visualization and camera management.services/: Core business logic and AI integration services.docs/: Project documentation.types.ts: TypeScript definitions for application interfaces and data structures.