|
| 1 | +# Vision Understanding Tool Implementation Plan |
| 2 | + |
| 3 | +> **For Claude:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. |
| 4 | +
|
| 5 | +**Goal:** Add a vision understanding tool that integrates ModelScope vision models, allowing users to paste image paths in chat and have the AI analyze them via tool calling. |
| 6 | + |
| 7 | +**Architecture:** Add a new `analyze_image` tool to the existing tools system. Create a vision model config file at `~/.chat/vision_model.json`. The tool sends the image (base64) + user prompt to the ModelScope OpenAI-compatible vision API with fallback support. |
| 8 | + |
| 9 | +**Tech Stack:** httpx (async HTTP), base64 (image encoding), LangChain @tool, OpenAI-compatible chat completions API |
| 10 | + |
| 11 | +--- |
| 12 | + |
| 13 | +### Task 1: Create Vision Model Config Module |
| 14 | + |
| 15 | +**Files:** |
| 16 | +- Create: `chcode/vision_config.py` |
| 17 | + |
| 18 | +**Step 1:** Create `chcode/vision_config.py` with: |
| 19 | +- Vision model presets (default: Kimi-K2.5, backups: Qwen3-VL series, Intern-S1) |
| 20 | +- Load/save vision config from `~/.chat/vision_model.json` |
| 21 | +- Auto-detect ModelScope token from env var or existing model config |
| 22 | +- Default vision config generation |
| 23 | + |
| 24 | +### Task 2: Add `analyze_image` Tool |
| 25 | + |
| 26 | +**Files:** |
| 27 | +- Modify: `chcode/utils/tools.py` — add `analyze_image` tool + register in `ALL_TOOLS` |
| 28 | + |
| 29 | +**Step 1:** Add `analyze_image` async tool that: |
| 30 | +- Accepts `image_path` and `prompt` params |
| 31 | +- Validates the image file exists and is a supported format (png/jpg/jpeg/gif/bmp/webp) |
| 32 | +- Reads the image file, base64-encodes it |
| 33 | +- Calls the ModelScope vision API (OpenAI-compatible chat completions with image content) |
| 34 | +- Falls back through backup vision models on failure |
| 35 | +- Returns the model's analysis text |
| 36 | + |
| 37 | +### Task 3: Update System Prompt |
| 38 | + |
| 39 | +**Files:** |
| 40 | +- Modify: `chcode/agent_setup.py` — update `load_skills` middleware to mention `analyze_image` |
| 41 | + |
| 42 | +**Step 1:** Add `analyze_image` to the system prompt tool list so the LLM knows to use it when users provide image paths. |
| 43 | + |
| 44 | +### Task 4: Update `/tools` Command Display |
| 45 | + |
| 46 | +**Files:** |
| 47 | +- Modify: `chcode/chat.py` — no changes needed (it reads from `ALL_TOOLS` dynamically) |
| 48 | + |
| 49 | +### Task 5: Add Vision Config Slash Command |
| 50 | + |
| 51 | +**Files:** |
| 52 | +- Modify: `chcode/chat.py` — add `/vision` command to configure vision models |
| 53 | +- Modify: `chcode/prompts.py` — add vision model configuration prompt |
| 54 | + |
| 55 | +**Step 1:** Add `/vision` slash command that lets users: |
| 56 | +- View current vision model config |
| 57 | +- Reconfigure vision models (pick default, set API key) |
| 58 | +- Test vision model connection |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +## Verification |
| 63 | + |
| 64 | +1. Run `chcode` and type `/tools` — `analyze_image` should appear in the list |
| 65 | +2. Type `/vision` — should show current vision config |
| 66 | +3. In chat, paste an image path like `./test.png` with a question — the LLM should call `analyze_image` |
0 commit comments