Discussion on #2 - Multimodal AI and Agent API Eval Framework #1226

animator · 2026-03-01T02:19:36Z

animator
Mar 1, 2026
Maintainer

Develop an end-to-end AI and Agent API eval framework which should (list is suggestive, not exhaustive):

Provide an intuitive interface to run AI benchmarks on tools (like lm-harness, lighteval).
Provide a UI interface for configuring AI API requests, where users can input test/custom datasets, configure request parameters, send queries to various AI API services and view the eval results.
Support evaluation of voice, image, text AI Models and AI Agents (via API interface) across various task benchmarks.

Backend: Python
Frontend: React/Node/TypeScript or Dart/Flutter

Please feel free to discuss your ideas/research below.

Spark960 · 2026-03-01T13:49:17Z

Spark960
Mar 1, 2026

Hi everyone!

I’ve been thinking about the architectural constraints for this framework, specifically focusing on how to handle the massive data rendering and long-running Python execution without freezing the UI or bloating the core Flutter application.

Based on the repo's history (especially the GSoC 2025 SDUI limits and the flutter_rust_bridge iOS/Android compilation issues), I've put together a proposal for a decoupled Tauri + React (Vite) companion subsystem, backed by a FastAPI execution engine.

Some things I came across:

Why Tauri over Electron/Next.js: It keeps the RAM overhead extremely low to respect API Dash's lightweight desktop philosophy, while still giving us React's virtual DOM to render massive JSON evaluation traces safely.

SSE Streaming: Running 1,000-prompt benchmarks via lm-harness will time out standard REST calls. Wrapping the Python scripts in FastAPI allows us to use Server-Sent Events (SSE) to stream the logs to the React UI in real-time.

Tabbed UX: I've separated the UI into an "Execution Tab" (for live streaming logs) and an "Analysis Tab" (for the heavy charts/tables) so the app stays responsive.

I've drawn up the sequence diagrams and UI mockups and submitted them in my draft PR here: #1136

Would love to hear any critique on this strategy if anyone sees any flaws or anything that can be done better.

1 reply

animator Mar 7, 2026
Maintainer Author

PR is reviewed and review has been provided.

mohamedahmedsalah002 · 2026-03-01T14:18:46Z

mohamedahmedsalah002
Mar 1, 2026

My name is Mohamed Salah, and I am an AI Engineering student with a strong software engineering background. I am writing to express my strong interest in contributing to the “End-to-End AI and Agent API Evaluation Framework” as part of Google Summer of Code (GSoC).
This project strongly aligns with my technical experience and interests. Building a unified framework to evaluate AI models and agent-based systems across text, image, and voice modalities—while supporting multiple benchmarking tools and real-world API workflows—is an area I am deeply motivated to work on.
Based on my background, I believe I can contribute effectively to this project:
Strong proficiency in Python and backend system design
Hands-on experience with LLM evaluation, RAG pipelines, and agent-based architectures
Experience integrating and orchestrating AI tools and APIs, including request configuration, dataset-driven testing, and result analysis
Familiarity with LangChain/LangGraph, multi-step AI workflows, and agent evaluation concepts
Practical experience building REST APIs, dashboards, and developer-facing tooling
Strong focus on clean architecture, extensibility, testing, and documentation
I am particularly interested in:
Designing a modular backend that can integrate existing evaluation tools (e.g., lm-harness, lighteval)
Building an intuitive UI for configuring AI/Agent API requests, datasets, and evaluation parameters
Supporting multi-modal evaluations (text, image, voice) through a unified API abstraction
Enabling consistent benchmarking and comparison across different AI providers and agent workflows
Delivering production-quality documentation and test coverage suitable for long-term community adoption
I am fully committed to a Google Summer of Code project and motivated to invest the required time and effort to deliver a high-quality, end-to-end evaluation framework. I would greatly appreciate the opportunity to discuss this project further and understand how I can best contribute.
Thank you very much for your time and consideration.

1 reply

animator Mar 7, 2026
Maintainer Author

Unable to get much clarity on what you are currently proposing reading this comment. I recommend you to send an idea doc with more details.

Armaansaxena · 2026-03-01T14:57:56Z

Armaansaxena
Mar 1, 2026

Hey everyone! 👋 I'm Armaan Saxena, a 3rd Year IT student with experience in React, TypeScript, Node.js, and Python.

I'm interested in GSoC 2026 Idea #2 — Multimodal AI & Agent API Eval Framework (350 hrs).

I've gone through the codebase, contribution guide, and the idea discussion thread (#1226). My thinking so far:

→ The eval framework can be split into a Python orchestration layer (benchmark runners like lm-harness/lighteval) + a React/TypeScript dashboard UI for configuring requests, uploading datasets, and visualizing results in real time.

→ For multimodal support, I'd start with text + image evaluation first, then extend to voice and agents.

A quick architecture question before I go deeper: Are you envisioning the Python benchmark runner as a separate microservice that the React UI polls via REST/WebSocket, or should it be more tightly coupled with the core app?

I'll be joining the Weekly Connect call to discuss further. Looking forward to contributing! 🚀

1 reply

animator Mar 7, 2026
Maintainer Author

You can send a idea doc with more details for the current approach.

Saad-Mallebhari · 2026-03-05T16:38:23Z

Saad-Mallebhari
Mar 5, 2026

Hi @animator!

I'm a 2nd year IT student with experience in Python, REST APIs, LLM evaluation, and agent-based architectures. I'm very interested in GSoC 2026 Idea #2 - Multimodal AI and Agent API Eval Framework.

After going through the thread and the project scope, here's where I think I can contribute most effectively:

Backend evaluation layer - integrating benchmark runners like lm-harness and lighteval into a Python orchestration backend, with REST/SSE endpoints to stream results to the frontend in real time.

Agent evaluation - from working with agent-based architectures, I understand the challenge of evaluating multi-step workflows where intermediate steps matter as much as final outputs. I'd like to focus on making agent eval a first-class citizen in the framework, not just an afterthought.

LLM API abstraction - building a unified request layer that handles text, image, and voice modalities consistently across different providers

Looking forward to the weekly calls and contributing!

1 reply

animator Mar 7, 2026
Maintainer Author

Agent evaluation for multimodal models would be good, but currently the focus is to make it easier for people to test their multimodal model/API outputs.

minaamulhaq · 2026-03-07T07:09:18Z

minaamulhaq
Mar 7, 2026

Hi @animator,

I'm a 2nd year Artificial Intelligence student and I'm very interested in contributing to GSoC 2026 Idea #2 – Multimodal AI and Agent API Evaluation Framework.

My primary interest is in Agentic AI systems. I have been working on agent-based architectures and LLM-powered workflows, focusing on how autonomous agents plan tasks, use tools, and produce multi-step outputs. Because of this experience, I’m particularly interested in contributing to the agent evaluation component, where evaluating intermediate reasoning steps, tool calls, and workflow reliability is just as important as the final response.

In addition, I have experience with Python and backend development, which would allow me to contribute to building evaluation pipelines, integrating model APIs, and developing backend components for running and managing evaluation tasks.

I have also built an Agentic AI project, and my work is available on my GitHub.

I would love to learn from the community, join the weekly discussions, and actively contribute to improving the evaluation framework.

Looking forward to collaborating!

1 reply

animator Mar 7, 2026
Maintainer Author

Agentic AI is a separate project. This thread is to discuss Project #2. You can send in an idea doc with more details of your approach for review.

YahyaElKawas · 2026-03-07T23:38:08Z

YahyaElKawas
Mar 7, 2026

Hi @animator, this is my cv link, and I am a 3rd year computer science student
Following up from my previous query (#1137), I’ve refined the proposed architecture for Project #2 to align with the goal of minimizing end-user dependencies.

Proposed 'Lite' Asynchronous Architecture:

Backend: Instead of Redis/Celery, I suggest using Python's native subprocess or multiprocessing modules to trigger benchmarks like lm-harness. This keeps the installation footprint small.

Real-time Logs: To avoid a heavy WebSocket setup if preferred, we could use Server-Sent Events (SSE). This allows the Python backend to 'push' benchmarking logs directly to the React/TypeScript frontend over a standard HTTP connection.

State Management: I plan to use a local SQLite instance (zero-config) to track the history and status of evaluation jobs.

My goal is to ensure that a developer can get the framework running with a simple pip install and npm install without needing to configure external services. I am currently mapping out the TypeScript interfaces for the 'Multimodal Input' objects (Text/Image/Audio) to ensure we have a unified data contract.

Does this 'Dependency-Lite' direction align with the vision for the shipped product?

1 reply

animator Mar 22, 2026
Maintainer Author

@YahyaElKawas Please follow the application guide and send proposal doc PR.

YahyaElKawas · 2026-03-08T23:07:48Z

YahyaElKawas
Mar 8, 2026

Building on the architecture discussion, I have a specific query regarding Multimodal Data Handling for Project #2.

Since we are supporting Voice and Image AI models, the framework will need to handle larger data inputs than standard text. Should we prioritize:

Local File Uploads: Allowing users to upload datasets directly from their machine to the local Python evaluator?

Remote URL Support: Prioritizing datasets hosted on cloud storage or Hugging Face to keep the local installation 'lite'?

I've worked with NASA Kepler data parsing and understand the importance of efficient data pipelines, so I want to ensure the UI handles these assets without slowing down the user experience.

1 reply

animator Mar 12, 2026
Maintainer Author

Local File Uploads ✅
Remote URL support ✅

KarimmYasser · 2026-03-09T22:05:42Z

KarimmYasser
Mar 9, 2026

Hi @animator ,

I have a question regarding the backend implementation for Project #2 - Multimodal AI and Agent API Eval Framework.

From the project description, it seems that the backend is expected to be implemented in Python, and the evaluation framework itself will likely be built from scratch.

I’m wondering whether it would be acceptable to implement the backend in Rust instead, mainly for better performance, memory safety, and concurrency, while still exposing the required interfaces for the evaluation framework (for example via REST/SSE or other compatible APIs).

In other words, as long as the backend integrates correctly with the framework’s expected APIs and supports running the evaluation pipelines and benchmarks, would using Rust instead of Python be considered acceptable?

I’d appreciate clarification on whether the language choice is flexible or if Python is a strict requirement for this project.

1 reply

animator Mar 12, 2026
Maintainer Author

Please join our weekly discussion calls to directly ask your doubts from the mentors.

YahyaElKawas · 2026-03-10T23:51:22Z

YahyaElKawas
Mar 10, 2026

I have a question about Agent Logic:
Since AI Agents often require multi-turn interactions, how should the framework handle 'Session State'? I propose an architecture that captures the full conversation trace for evaluation, rather than just the final output, is that right?

1 reply

animator Mar 12, 2026
Maintainer Author

Kindly limit to posting all your thoughts in a single thread instead of adding a new comment every time.

VanshKaushal · 2026-03-12T09:28:23Z

VanshKaushal
Mar 12, 2026

Hi everyone @animator 👋

I explored the idea of building an end-to-end AI and Agent API evaluation framework, and I’d like to share some thoughts on how such a system could be designed.

The main goal seems to be creating a unified platform where developers can benchmark and evaluate AI models and agents across multiple providers and modalities (text, image, voice). Instead of manually writing scripts for evaluation, users could configure tests through a UI and run standardized benchmarks.

One approach would be to structure the system around three main layers: evaluation engine, API orchestration layer, and visualization dashboard.

For the backend, Python would be well suited since most evaluation tools and ML ecosystems already exist there. The evaluation engine could integrate with frameworks such as lm-harness and lighteval to run standard benchmarks. The system would load datasets, send prompts to different model APIs, collect responses, and compute metrics such as accuracy, BLEU, ROUGE, latency, cost, and pass/fail rates.

Above this, an API orchestration layer could handle interactions with different AI providers. This layer would normalize API requests so that the same evaluation task can run across multiple services (for example OpenAI-style APIs, local models, or hosted inference services). It would also support agent evaluations by allowing tool-call simulations and multi-step workflows.

For the frontend, a React + TypeScript interface could provide a dashboard where users can configure evaluation runs. The UI might include sections for selecting a benchmark, uploading or defining custom datasets, configuring model parameters (temperature, max tokens, etc.), and selecting which APIs or models to test. Users could then launch evaluation runs directly from the interface.

The results dashboard would visualize metrics in a clear way. For example, leaderboards comparing models across benchmarks, charts showing accuracy or latency trends, detailed logs of model outputs, and side-by-side comparisons of responses. For agent evaluations, it could also display tool usage traces or step-by-step reasoning flows.

To support multi-modal evaluation, the framework could define different evaluation adapters. Text models would run standard NLP benchmarks, image models could be evaluated using captioning or classification tasks, and voice models could be tested using speech-to-text or text-to-speech benchmarks. Each modality would share the same core pipeline but use specialized metrics.

Another useful feature would be a plugin-based benchmark system. This would allow users to add new tasks or datasets without modifying the core framework. Benchmarks could define their dataset, evaluation logic, and scoring metrics, making the system extensible.

A typical workflow might look like this: the user selects a benchmark or uploads a dataset, chooses one or more model APIs, configures parameters, and starts an evaluation run. The backend distributes the requests, collects responses, calculates metrics, and sends the results back to the UI where they are visualized in dashboards and comparison tables.

Some possible future improvements could include experiment tracking, dataset versioning, reproducibility settings, and support for automated regression testing of models.

I would be interested in prototyping a minimal evaluation pipeline (dataset → API requests → scoring → visualization) to validate the architecture and gradually extend it with more benchmarks and modalities.

Looking forward to feedback and suggestions on this approach.

1 reply

animator Mar 12, 2026
Maintainer Author

@VanshKaushal you can send a idea doc PR.

James-ezechinyere · 2026-03-13T16:44:33Z

James-ezechinyere
Mar 13, 2026

Hi @animator, I'm James from Nigeria, ML engineer with experience in LLM evaluation pipelines. Based on the thread I understand the core goal is making multimodal API output testing intuitive — not agent evaluation. My approach is a Flutter-native eval tab with a Python backend wrapping lm-harness/lighteval, supporting local and remote datasets. One question: for image evaluation specifically, are you thinking captioning/VQA benchmarks or custom user-defined test cases?

1 reply

animator Mar 22, 2026
Maintainer Author

@James-ezechinyere Please follow the application guide and send proposal doc PR for your proposed solution.

Sahil-aka · 2026-03-18T17:13:59Z

Sahil-aka
Mar 18, 2026

Hi! I'm really interested in this idea.

I’ve recently been working on building a benchmarking and evaluation system for eye-tracking APIs, where I designed an evaluation pipeline, implemented statistical metrics, and added reporting (including PDF-based summaries). This gave me hands-on experience with structuring evaluation workflows and handling model performance metrics.

For this project, I’m particularly interested in:

Designing a modular evaluation backend (supporting text, image, and potentially voice models)
Integrating benchmark tools like lm-harness/lighteval
Building a flexible API layer for running evaluations across different AI services

I was thinking of structuring the system into:

Evaluation Engine (metrics + benchmark runners)
API Layer (handling requests, datasets, and model calls)
Result Processing + Visualization layer

I’d love to know:

Is there any existing evaluation-related work already in the repo?
What would be a good first contribution to get started?

Looking forward to contributing!

1 reply

animator Mar 22, 2026
Maintainer Author

@Sahil-aka Please follow the application guide and send proposal doc PR for your proposed solution.

Devil-nkp · 2026-03-19T06:51:58Z

Devil-nkp
Mar 19, 2026

Hi @animator and team,
I am Naveen Kumar,
I'm really interested in the Multimodal AI and Agent API Eval Framework project for GSoC 2026.
I’ve been working with Python to build evaluation pipelines, RAG systems, and multimodal workflows (voice, image, and text). The idea of creating an intuitive interface to run benchmarks like lm-harness and lighteval while supporting agent evaluation really appeals to me.
I’ve set up the repo locally and would love to start contributing. Is there a good first issue or small task I can pick up to get familiar with the codebase?
Happy to follow any guidelines and share progress regularly.
Thanks!

1 reply

animator Mar 22, 2026
Maintainer Author

@Devil-nkp Please follow the application guide and send proposal doc PR for your proposed solution.

Balukodeboyina · 2026-03-19T15:21:47Z

Balukodeboyina
Mar 19, 2026

Hi everyone,

I am Balu Kodeboyina, a Computer Science student from India.

I am interested in Idea #2 – Multimodal AI & Agent API Eval Framework project for GSoC 2026

My background is in AI/ML and LLM systems. I have experience with Python,
LangChain, LangGraph, Llama models, OpenAI API, HuggingFace Transformers,
RAG architecture, vector embeddings, FAISS, and PyTorch/TensorFlow.

I have started exploring the repository and contribution guide,
and I would like to begin contributing to issues related to this idea.

Could you please suggest beginner-friendly issues to get started?

Looking forward to contributing to API Dash.

1 reply

animator Mar 22, 2026
Maintainer Author

@Balukodeboyina Please follow the application guide and send proposal doc PR for your proposed solution.

KERDAWY-2 · 2026-03-23T23:23:46Z

KERDAWY-2
Mar 23, 2026

Hi @animator, I'm Abdelrahman, a 3rd-year Communication and Information Engineering student at Zewail City. I've submitted my proposal PR and wanted to share my technical thinking here.

I went with Flutter/Dart frontend + Python (FastAPI) backend for this. A few specific decisions I landed on after reading through this thread:

On streaming: SSE over WebSocket for progress updates from the eval runner — simpler to implement, no bidirectional overhead needed since the client only needs to receive run progress, not send anything back mid-run.

On dataset handling: Supporting both local file uploads and remote URLs (glad to see that's confirmed ✅). For the data model I'm thinking a simple normalized schema: {id, input_type, input_payload, expected_output} that works across text, image (base64 or URL), and audio.

On metrics: Starting with exact match, BLEU, and ROUGE for text. For image captioning tasks, BLEU/ROUGE still apply. Custom metric scripts as a plugin so users aren't locked into the defaults.

One question: For the eval runner integration with lm-harness/lighteval — are you expecting contributors to wrap the existing CLI tools as subprocesses, or build a proper Python API integration against their library interfaces? The subprocess approach is simpler to get working but the library approach would give cleaner programmatic control over results.

Proposal PR: #1430

1 reply

animator Apr 1, 2026
Maintainer Author

@KERDAWY-2 go through this resource to study if MCP Apps can be used to show Eval UI directly inside AI Agent Chat.

Awaisranahmad · 2026-03-25T06:08:52Z

Awaisranahmad
Mar 25, 2026

Hi @animator, I'm an IT student currently building AI-integrated Flutter apps. I've developed projects like 'VibeCheck AI' (real-time multimodal mood detection using Groq/MediaPipe) and 'AI Sentinel Ultra'. I'm very interested in Idea#2. Since API Dash is already in Flutter, I'm planning to propose a Flutter-native UI for the Eval Framework with a Python/FastAPI backend to handle benchmarks. Working on my proposal doc now!

3 replies

animator Apr 1, 2026
Maintainer Author

@Awaisranahmad go through this resource to study if MCP Apps can be used to show Eval UI directly inside AI Agent Chat.

Awaisranahmad Apr 2, 2026

Hi @animator, I've already started exploring the MCP Apps resource you shared. I'm planning to use the Python MCP SDK with FastAPI to build the backend. This way, we can expose the evaluation as 'Tools' that the AI Agent can trigger. For the Agentic UI, I'm looking into how to use Artifacts to display real-time charts and benchmark tables directly in the chat, similar to the sales analytics sample.

Awaisranahmad Apr 2, 2026

Hi @animator, I've thoroughly analyzed the MCP Apps resource. Based on it, I've refined the architecture for Idea #2.

My plan is to use FastAPI as an MCP Server to handle evaluations asynchronously. This will allow the AI Agent to trigger tools and receive real-time logs via Resources. For the UI, I'll use Artifacts to render interactive charts/tables directly in the chat, ensuring a seamless 'Agentic' experience.

I've already designed the Sequence Diagram for this flow (attached below/link here) to ensure a robust implementation. Although my proposal PDF is submitted, I'll be following this modern MCP-based approach during the coding period

Shivamtyagi179 · 2026-03-25T19:02:57Z

Shivamtyagi179
Mar 25, 2026

Hi! I’m Shivam Tyagi.

I’m also very interested in Idea #2 (Multimodal AI & Agent API Eval Framework).
I have experience working with LLM-based assistants, API integrations, and evaluation workflows.

I wanted to ask:

Are there any existing components or partial implementations in the repo that we should build upon?
For evaluation pipelines, is there a preferred approach (e.g., integrating existing frameworks like lm-eval/lighteval vs building custom modules)?

Also, I’d love to start contributing — could you suggest some beginner-friendly issues aligned with this idea?

Looking forward to contributing!

2 replies

animator Apr 1, 2026
Maintainer Author

@Shivamtyagi179 go through this resource to study if MCP Apps can be used to show Eval UI directly inside AI Agent Chat.

Shivamtyagi179 Apr 1, 2026

Thanks for the guidance!

I’ll go through the MCP Apps resource and explore how it can be used to integrate the evaluation UI directly within the AI Agent Chat.

I’ll try to come up with a small approach or prototype around this and share my understanding here.

Also, please let me know if there are any specific parts of the repo or issues I should explore while working on this.

Looking forward to contributing!

PrajwaL-N-TECHIE · 2026-03-29T08:09:31Z

PrajwaL-N-TECHIE
Mar 29, 2026

Hi @animator and the API Dash team 👋
I'm Prajwal N, a B.Tech AI & Data Science student from SNS College of Engineering, India. I've been following API Dash for a while — the decision to build it in Flutter as a true Postman/Insomnia alternative that runs cross-platform from a single codebase is exactly the kind of thoughtful open-source engineering I want to contribute to.
I'm applying for Idea #2 — Multimodal AI and Agent API Eval Framework for GSoC 2026.
Why this idea excites me specifically:
API Dash already excels at helping developers send API requests and inspect responses. The eval framework is the natural next step — helping developers systematically measure how AI APIs perform across benchmarks, custom datasets, and modalities. The gap between "I can query GPT-4o" and "I can objectively compare GPT-4o vs Gemini on my specific use case" is exactly where this project lives, and I think it fits perfectly inside API Dash's existing API-testing philosophy.
Relevant background:

At EduSpine (Technical Lead), I built production Python/Flask backends handling 500+ concurrent AI API requests and designed custom evaluation pipelines using RAGAS to measure hallucination across RAG configurations
Built a multimodal AI pipeline (OpenCV + Flask) for medical image classification — directly relevant to the image eval layer
Built a real-time React + WebSocket dashboard for federated learning — same architecture as the eval run monitor
Worked with lm-evaluation-harness and lighteval in personal projects for LLM benchmarking
GitHub: github.com/PrajwaL-N-TECHIE

Specific questions before I finalize my proposal:

For the UI layer — does the team prefer the eval framework UI to be built inside the existing Flutter/Dart codebase, or is a separate React/TypeScript companion web app acceptable?
For agent evaluation — is there a preferred agent framework (OpenAI Agents SDK, LangGraph, custom) you'd like the trajectory capture to support first?
Are there specific benchmarks or AI providers the team considers highest priority for the first milestone?

I've already drafted a detailed 12-week proposal and would love feedback before submission. Happy to share a draft here or discuss on Discord in the #gsoc-foss-apidash channel.
Thanks for building something genuinely useful for the developer community 🙏
— Prajwal N

2 replies

animator Apr 1, 2026
Maintainer Author

@PrajwaL-N-TECHIE go through this resource to study if MCP Apps can be used to show Eval UI directly inside AI Agent Chat.

PrajwaL-N-TECHIE Apr 4, 2026

sure ...!

kinthaiofficial · 2026-04-29T00:00:35Z

kinthaiofficial
Apr 29, 2026

Agent API evaluation needs different dimensions than traditional API testing. A few additions to consider:

1. Cost-per-task as a first-class metric: Two agents might both complete a task correctly, but one costs $0.003 and the other costs $0.15. The eval framework should report cost alongside accuracy. We track costs in millicents (1/100,000 of a dollar) for precision.

2. Retry rate and self-correction efficiency: An agent that gets it right on the first try at $0.01 is better than one that retries 3 times to get the same result at $0.03 total. Track both first-attempt accuracy and cost-adjusted accuracy.

3. Delegation chain depth: For multi-agent evaluations, measure how deep the delegation chain goes. An agent that solves a task by delegating to 5 sub-agents creates more latency, more cost, and more failure points than one that handles it directly. Deeper is not better unless the task genuinely requires specialization.

4. Context efficiency: How much of the context window does the agent actually use productively? An agent that stuffs 100K tokens of context but only references 5K tokens is wasting money. We measure "context density" — the ratio of referenced tokens to total context tokens.

5. Behavioral consistency: Run the same evaluation multiple times and measure variance. An agent that scores 95% one run and 60% the next is less useful than one that consistently scores 80%.

More on these metrics in practice: https://blog.kinthai.ai/agent-wallet-economic-models-autonomous-agents

0 replies

Discussion on #2 - Multimodal AI and Agent API Eval Framework #1226

Uh oh!

animator Mar 1, 2026 Maintainer

Replies: 19 comments · 22 replies

Uh oh!

Uh oh!

animator Mar 7, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 7, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 7, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

animator Mar 7, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 7, 2026 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

animator Mar 22, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 12, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 12, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 12, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 12, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 22, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 22, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 22, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Mar 22, 2026 Maintainer Author

Uh oh!

Uh oh!

animator Apr 1, 2026 Maintainer Author

Uh oh!

Uh oh!

animator
Mar 1, 2026
Maintainer

Replies: 19 comments 22 replies

animator Mar 7, 2026
Maintainer Author

animator Mar 7, 2026
Maintainer Author

animator Mar 7, 2026
Maintainer Author

animator Mar 7, 2026
Maintainer Author

animator Mar 7, 2026
Maintainer Author

animator Mar 22, 2026
Maintainer Author

animator Mar 12, 2026
Maintainer Author

animator Mar 12, 2026
Maintainer Author

animator Mar 12, 2026
Maintainer Author

animator Mar 12, 2026
Maintainer Author

animator Mar 22, 2026
Maintainer Author

animator Mar 22, 2026
Maintainer Author

animator Mar 22, 2026
Maintainer Author

animator Mar 22, 2026
Maintainer Author

animator Apr 1, 2026
Maintainer Author