Discussion on #2 - Multimodal AI and Agent API Eval Framework #1226
Replies: 19 comments 22 replies
-
|
Hi everyone! I’ve been thinking about the architectural constraints for this framework, specifically focusing on how to handle the massive data rendering and long-running Python execution without freezing the UI or bloating the core Flutter application. Based on the repo's history (especially the GSoC 2025 SDUI limits and the flutter_rust_bridge iOS/Android compilation issues), I've put together a proposal for a decoupled Tauri + React (Vite) companion subsystem, backed by a FastAPI execution engine. Some things I came across: Why Tauri over Electron/Next.js: It keeps the RAM overhead extremely low to respect API Dash's lightweight desktop philosophy, while still giving us React's virtual DOM to render massive JSON evaluation traces safely. SSE Streaming: Running 1,000-prompt benchmarks via lm-harness will time out standard REST calls. Wrapping the Python scripts in FastAPI allows us to use Server-Sent Events (SSE) to stream the logs to the React UI in real-time. Tabbed UX: I've separated the UI into an "Execution Tab" (for live streaming logs) and an "Analysis Tab" (for the heavy charts/tables) so the app stays responsive. I've drawn up the sequence diagrams and UI mockups and submitted them in my draft PR here: #1136 Would love to hear any critique on this strategy if anyone sees any flaws or anything that can be done better. |
Beta Was this translation helpful? Give feedback.
-
|
My name is Mohamed Salah, and I am an AI Engineering student with a strong software engineering background. I am writing to express my strong interest in contributing to the “End-to-End AI and Agent API Evaluation Framework” as part of Google Summer of Code (GSoC). |
Beta Was this translation helpful? Give feedback.
-
|
Hey everyone! 👋 I'm Armaan Saxena, a 3rd Year IT student with experience in React, TypeScript, Node.js, and Python. I'm interested in GSoC 2026 Idea #2 — Multimodal AI & Agent API Eval Framework (350 hrs). I've gone through the codebase, contribution guide, and the idea discussion thread (#1226). My thinking so far: → The eval framework can be split into a Python orchestration layer (benchmark runners like lm-harness/lighteval) + a React/TypeScript dashboard UI for configuring requests, uploading datasets, and visualizing results in real time. → For multimodal support, I'd start with text + image evaluation first, then extend to voice and agents. A quick architecture question before I go deeper: Are you envisioning the Python benchmark runner as a separate microservice that the React UI polls via REST/WebSocket, or should it be more tightly coupled with the core app? I'll be joining the Weekly Connect call to discuss further. Looking forward to contributing! 🚀 |
Beta Was this translation helpful? Give feedback.
-
|
Hi @animator! I'm a 2nd year IT student with experience in Python, REST APIs, LLM evaluation, and agent-based architectures. I'm very interested in GSoC 2026 Idea #2 - Multimodal AI and Agent API Eval Framework. After going through the thread and the project scope, here's where I think I can contribute most effectively: Backend evaluation layer - integrating benchmark runners like lm-harness and lighteval into a Python orchestration backend, with REST/SSE endpoints to stream results to the frontend in real time. Agent evaluation - from working with agent-based architectures, I understand the challenge of evaluating multi-step workflows where intermediate steps matter as much as final outputs. I'd like to focus on making agent eval a first-class citizen in the framework, not just an afterthought. LLM API abstraction - building a unified request layer that handles text, image, and voice modalities consistently across different providers Looking forward to the weekly calls and contributing! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @animator, I'm a 2nd year Artificial Intelligence student and I'm very interested in contributing to GSoC 2026 Idea #2 – Multimodal AI and Agent API Evaluation Framework. My primary interest is in Agentic AI systems. I have been working on agent-based architectures and LLM-powered workflows, focusing on how autonomous agents plan tasks, use tools, and produce multi-step outputs. Because of this experience, I’m particularly interested in contributing to the agent evaluation component, where evaluating intermediate reasoning steps, tool calls, and workflow reliability is just as important as the final response. In addition, I have experience with Python and backend development, which would allow me to contribute to building evaluation pipelines, integrating model APIs, and developing backend components for running and managing evaluation tasks. I have also built an Agentic AI project, and my work is available on my GitHub. I would love to learn from the community, join the weekly discussions, and actively contribute to improving the evaluation framework. Looking forward to collaborating! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @animator, this is my cv link, and I am a 3rd year computer science student Proposed 'Lite' Asynchronous Architecture: Backend: Instead of Redis/Celery, I suggest using Python's native subprocess or multiprocessing modules to trigger benchmarks like lm-harness. This keeps the installation footprint small. Real-time Logs: To avoid a heavy WebSocket setup if preferred, we could use Server-Sent Events (SSE). This allows the Python backend to 'push' benchmarking logs directly to the React/TypeScript frontend over a standard HTTP connection. State Management: I plan to use a local SQLite instance (zero-config) to track the history and status of evaluation jobs. My goal is to ensure that a developer can get the framework running with a simple pip install and npm install without needing to configure external services. I am currently mapping out the TypeScript interfaces for the 'Multimodal Input' objects (Text/Image/Audio) to ensure we have a unified data contract. Does this 'Dependency-Lite' direction align with the vision for the shipped product? |
Beta Was this translation helpful? Give feedback.
-
|
Building on the architecture discussion, I have a specific query regarding Multimodal Data Handling for Project #2. Since we are supporting Voice and Image AI models, the framework will need to handle larger data inputs than standard text. Should we prioritize: Local File Uploads: Allowing users to upload datasets directly from their machine to the local Python evaluator? Remote URL Support: Prioritizing datasets hosted on cloud storage or Hugging Face to keep the local installation 'lite'? I've worked with NASA Kepler data parsing and understand the importance of efficient data pipelines, so I want to ensure the UI handles these assets without slowing down the user experience. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @animator , I have a question regarding the backend implementation for Project #2 - Multimodal AI and Agent API Eval Framework. From the project description, it seems that the backend is expected to be implemented in Python, and the evaluation framework itself will likely be built from scratch. I’m wondering whether it would be acceptable to implement the backend in Rust instead, mainly for better performance, memory safety, and concurrency, while still exposing the required interfaces for the evaluation framework (for example via REST/SSE or other compatible APIs). In other words, as long as the backend integrates correctly with the framework’s expected APIs and supports running the evaluation pipelines and benchmarks, would using Rust instead of Python be considered acceptable? I’d appreciate clarification on whether the language choice is flexible or if Python is a strict requirement for this project. |
Beta Was this translation helpful? Give feedback.
-
|
I have a question about Agent Logic: |
Beta Was this translation helpful? Give feedback.
-
|
Hi everyone @animator 👋 I explored the idea of building an end-to-end AI and Agent API evaluation framework, and I’d like to share some thoughts on how such a system could be designed. The main goal seems to be creating a unified platform where developers can benchmark and evaluate AI models and agents across multiple providers and modalities (text, image, voice). Instead of manually writing scripts for evaluation, users could configure tests through a UI and run standardized benchmarks. One approach would be to structure the system around three main layers: evaluation engine, API orchestration layer, and visualization dashboard. For the backend, Python would be well suited since most evaluation tools and ML ecosystems already exist there. The evaluation engine could integrate with frameworks such as lm-harness and lighteval to run standard benchmarks. The system would load datasets, send prompts to different model APIs, collect responses, and compute metrics such as accuracy, BLEU, ROUGE, latency, cost, and pass/fail rates. Above this, an API orchestration layer could handle interactions with different AI providers. This layer would normalize API requests so that the same evaluation task can run across multiple services (for example OpenAI-style APIs, local models, or hosted inference services). It would also support agent evaluations by allowing tool-call simulations and multi-step workflows. For the frontend, a React + TypeScript interface could provide a dashboard where users can configure evaluation runs. The UI might include sections for selecting a benchmark, uploading or defining custom datasets, configuring model parameters (temperature, max tokens, etc.), and selecting which APIs or models to test. Users could then launch evaluation runs directly from the interface. The results dashboard would visualize metrics in a clear way. For example, leaderboards comparing models across benchmarks, charts showing accuracy or latency trends, detailed logs of model outputs, and side-by-side comparisons of responses. For agent evaluations, it could also display tool usage traces or step-by-step reasoning flows. To support multi-modal evaluation, the framework could define different evaluation adapters. Text models would run standard NLP benchmarks, image models could be evaluated using captioning or classification tasks, and voice models could be tested using speech-to-text or text-to-speech benchmarks. Each modality would share the same core pipeline but use specialized metrics. Another useful feature would be a plugin-based benchmark system. This would allow users to add new tasks or datasets without modifying the core framework. Benchmarks could define their dataset, evaluation logic, and scoring metrics, making the system extensible. A typical workflow might look like this: the user selects a benchmark or uploads a dataset, chooses one or more model APIs, configures parameters, and starts an evaluation run. The backend distributes the requests, collects responses, calculates metrics, and sends the results back to the UI where they are visualized in dashboards and comparison tables. Some possible future improvements could include experiment tracking, dataset versioning, reproducibility settings, and support for automated regression testing of models. I would be interested in prototyping a minimal evaluation pipeline (dataset → API requests → scoring → visualization) to validate the architecture and gradually extend it with more benchmarks and modalities. Looking forward to feedback and suggestions on this approach. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @animator, I'm James from Nigeria, ML engineer with experience in LLM evaluation pipelines. Based on the thread I understand the core goal is making multimodal API output testing intuitive — not agent evaluation. My approach is a Flutter-native eval tab with a Python backend wrapping lm-harness/lighteval, supporting local and remote datasets. One question: for image evaluation specifically, are you thinking captioning/VQA benchmarks or custom user-defined test cases? |
Beta Was this translation helpful? Give feedback.
-
|
Hi! I'm really interested in this idea. I’ve recently been working on building a benchmarking and evaluation system for eye-tracking APIs, where I designed an evaluation pipeline, implemented statistical metrics, and added reporting (including PDF-based summaries). This gave me hands-on experience with structuring evaluation workflows and handling model performance metrics. For this project, I’m particularly interested in:
I was thinking of structuring the system into:
I’d love to know:
Looking forward to contributing! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @animator and team, |
Beta Was this translation helpful? Give feedback.
-
|
Hi everyone, I am Balu Kodeboyina, a Computer Science student from India. I am interested in Idea #2 – Multimodal AI & Agent API Eval Framework project for GSoC 2026 My background is in AI/ML and LLM systems. I have experience with Python, I have started exploring the repository and contribution guide, Could you please suggest beginner-friendly issues to get started? Looking forward to contributing to API Dash. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @animator, I'm Abdelrahman, a 3rd-year Communication and Information Engineering student at Zewail City. I've submitted my proposal PR and wanted to share my technical thinking here. I went with Flutter/Dart frontend + Python (FastAPI) backend for this. A few specific decisions I landed on after reading through this thread: On streaming: SSE over WebSocket for progress updates from the eval runner — simpler to implement, no bidirectional overhead needed since the client only needs to receive run progress, not send anything back mid-run. On dataset handling: Supporting both local file uploads and remote URLs (glad to see that's confirmed ✅). For the data model I'm thinking a simple normalized schema: On metrics: Starting with exact match, BLEU, and ROUGE for text. For image captioning tasks, BLEU/ROUGE still apply. Custom metric scripts as a plugin so users aren't locked into the defaults. One question: For the eval runner integration with lm-harness/lighteval — are you expecting contributors to wrap the existing CLI tools as subprocesses, or build a proper Python API integration against their library interfaces? The subprocess approach is simpler to get working but the library approach would give cleaner programmatic control over results. Proposal PR: #1430 |
Beta Was this translation helpful? Give feedback.
-
|
Hi @animator, I'm an IT student currently building AI-integrated Flutter apps. I've developed projects like 'VibeCheck AI' (real-time multimodal mood detection using Groq/MediaPipe) and 'AI Sentinel Ultra'. I'm very interested in Idea#2. Since API Dash is already in Flutter, I'm planning to propose a Flutter-native UI for the Eval Framework with a Python/FastAPI backend to handle benchmarks. Working on my proposal doc now! |
Beta Was this translation helpful? Give feedback.
-
|
Hi! I’m Shivam Tyagi. I’m also very interested in Idea #2 (Multimodal AI & Agent API Eval Framework). I wanted to ask:
Also, I’d love to start contributing — could you suggest some beginner-friendly issues aligned with this idea? Looking forward to contributing! |
Beta Was this translation helpful? Give feedback.
-
|
Hi @animator and the API Dash team 👋 At EduSpine (Technical Lead), I built production Python/Flask backends handling 500+ concurrent AI API requests and designed custom evaluation pipelines using RAGAS to measure hallucination across RAG configurations Specific questions before I finalize my proposal: For the UI layer — does the team prefer the eval framework UI to be built inside the existing Flutter/Dart codebase, or is a separate React/TypeScript companion web app acceptable? I've already drafted a detailed 12-week proposal and would love feedback before submission. Happy to share a draft here or discuss on Discord in the #gsoc-foss-apidash channel. |
Beta Was this translation helpful? Give feedback.
-
|
Agent API evaluation needs different dimensions than traditional API testing. A few additions to consider: 1. Cost-per-task as a first-class metric: Two agents might both complete a task correctly, but one costs $0.003 and the other costs $0.15. The eval framework should report cost alongside accuracy. We track costs in millicents (1/100,000 of a dollar) for precision. 2. Retry rate and self-correction efficiency: An agent that gets it right on the first try at $0.01 is better than one that retries 3 times to get the same result at $0.03 total. Track both first-attempt accuracy and cost-adjusted accuracy. 3. Delegation chain depth: For multi-agent evaluations, measure how deep the delegation chain goes. An agent that solves a task by delegating to 5 sub-agents creates more latency, more cost, and more failure points than one that handles it directly. Deeper is not better unless the task genuinely requires specialization. 4. Context efficiency: How much of the context window does the agent actually use productively? An agent that stuffs 100K tokens of context but only references 5K tokens is wasting money. We measure "context density" — the ratio of referenced tokens to total context tokens. 5. Behavioral consistency: Run the same evaluation multiple times and measure variance. An agent that scores 95% one run and 60% the next is less useful than one that consistently scores 80%. More on these metrics in practice: https://blog.kinthai.ai/agent-wallet-economic-models-autonomous-agents |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
Develop an end-to-end AI and Agent API eval framework which should (list is suggestive, not exhaustive):
Provide an intuitive interface to run AI benchmarks on tools (like lm-harness, lighteval).
Provide a UI interface for configuring AI API requests, where users can input test/custom datasets, configure request parameters, send queries to various AI API services and view the eval results.
Support evaluation of voice, image, text AI Models and AI Agents (via API interface) across various task benchmarks.
Backend: Python
Frontend: React/Node/TypeScript or Dart/Flutter
Please feel free to discuss your ideas/research below.
Beta Was this translation helpful? Give feedback.
All reactions