Skip to content

Latest commit

 

History

History
183 lines (153 loc) · 6.08 KB

File metadata and controls

183 lines (153 loc) · 6.08 KB

🏗️ Architecture Overview

System Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Frontend      │    │   Backend        │    │   Memory        │
│   (Next.js)     │◄──►│   (FastAPI)      │◄──►│   (SQLite)      │
│   Port: 3000    │    │   Port: 8001     │    │   Database      │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         │              ┌──────────────────┐             │
         └──────────────►│   Playground     │◄────────────┘
                        │   (WebSocket)    │
                        │   Port: 8765     │
                        └──────────────────┘
                                 │
                    ┌─────────────────────────┐
                    │   Voice Processing      │
                    │   • STT (Whisper)       │
                    │   • TTS (Multi-Engine)  │
                    │   • LLM (LM Studio)     │
                    └─────────────────────────┘

Component Details

Frontend (Next.js 14)

  • Framework: React 18 with TypeScript
  • Styling: Tailwind CSS with custom cyberpunk theme
  • Animations: Framer Motion for 3D effects
  • Audio: Web Audio API for interactive sounds
  • State: React hooks with WebSocket integration

Key Pages:

  • / - Homepage with feature cards
  • /memory-playground - Live voice chat interface
  • /voice-clone - Voice cloning studio
  • /conversation - Multi-speaker conversations

Backend (FastAPI)

  • Framework: FastAPI with async/await
  • Database: SQLite with custom memory service
  • Voice Processing: Multi-engine TTS integration
  • GPU Acceleration: ONNX Runtime with CUDA
  • API Documentation: Auto-generated OpenAPI/Swagger

Core Services:

  • STTService - Whisper-based speech recognition
  • TTSService - Multi-engine text-to-speech
  • LLMService - LM Studio integration
  • VibeVoiceService - Voice cloning and synthesis
  • ConversationEngine - Multi-speaker dialogues

Memory Playground (WebSocket)

  • Protocol: WebSocket for real-time communication
  • Audio: PyAudio for recording/playback
  • Recognition: SpeechRecognition with Google API
  • Memory: Integration with SQLite memory service
  • Multi-Engine: Support for VibeVoice, KaniTTS, IndexTTS2

Memory System (SQLite)

  • Database: SQLite with conversation tables
  • Sessions: User session management
  • Context: Conversation history with search
  • Persistence: Cross-session memory retention

Data Flow

Voice Chat Pipeline

User Voice Input
      ↓
WebSocket → Playground
      ↓
STT (Whisper) → Text
      ↓
Memory Service → Context
      ↓
LLM (LM Studio) → Response
      ↓
TTS (Multi-Engine) → Audio
      ↓
WebSocket → Frontend → User

Voice Cloning Workflow

Audio Recording
      ↓
WebSocket → Playground
      ↓
Backend API → VibeVoice
      ↓
Model Training → Voice Clone
      ↓
Database Storage → Voice Profile
      ↓
TTS Integration → Available Voice

Technology Stack

Core Technologies

  • Python 3.9+ - Backend services
  • Node.js 18+ - Frontend build system
  • TypeScript - Type-safe development
  • SQLite - Lightweight database
  • WebSocket - Real-time communication

AI/ML Stack

  • Whisper AI - Speech-to-text
  • VibeVoice - Voice cloning
  • LM Studio - Local LLM inference
  • ONNX Runtime - GPU acceleration
  • PyTorch - Deep learning framework

Voice Processing

  • PyAudio - Audio I/O
  • SpeechRecognition - STT wrapper
  • Pydub - Audio manipulation
  • SoundFile - Audio file handling
  • LibROSA - Audio analysis

Security Considerations

Data Protection

  • Local Processing - No cloud dependencies
  • Encrypted Storage - SQLite with encryption options
  • Session Management - Secure WebSocket connections
  • File Validation - Audio file type checking

API Security

  • CORS Configuration - Controlled cross-origin requests
  • Input Validation - Pydantic models for data validation
  • Error Handling - Secure error messages
  • Rate Limiting - Request throttling (configurable)

Performance Optimization

GPU Acceleration

  • CUDA Support - RTX 5090 optimization
  • ONNX Models - Optimized inference
  • Model Caching - Prevent reloading
  • Batch Processing - Efficient GPU utilization

Memory Management

  • Connection Pooling - Database connections
  • Audio Streaming - Chunked processing
  • Cache Strategy - Voice sample caching
  • Cleanup Routines - Temporary file management

Scalability

Horizontal Scaling

  • Microservices - Independent service scaling
  • Load Balancing - Multiple backend instances
  • Database Sharding - User-based partitioning
  • CDN Integration - Static asset delivery

Vertical Scaling

  • GPU Scaling - Multi-GPU support
  • Memory Optimization - Efficient data structures
  • CPU Utilization - Async processing
  • Storage Optimization - Compressed audio formats

Monitoring & Logging

Application Monitoring

  • Health Checks - Service availability
  • Performance Metrics - Response times
  • Error Tracking - Exception monitoring
  • Usage Analytics - Feature utilization

System Monitoring

  • GPU Utilization - CUDA metrics
  • Memory Usage - RAM and VRAM tracking
  • Disk I/O - Storage performance
  • Network Latency - WebSocket performance

Architecture designed for enterprise-scale voice AI applications with real-time performance and GPU acceleration.