A technical utility for localized screen region OCR and translation leveraging high-speed inference on the Groq platform.
This tool performs zero-shot OCR and context-aware translation by pipe-lining screen-captured buffers directly into a Multimodal Large Language Model (MLLM).
- Inference Server: Groq Cloud (LPU™ Inference Engine)
- Primary Model:
meta-llama/llama-4-scout-17b-16e-instruct - Frontend: Python
tkinter(Region selection & result rendering) - Image Processing:
pyautogui+PIL(Pillow) - Standard: JSON-mode structured output for precise UI mapping
- Platform: Windows 10/11
- Python: 3.8+
- Dependencies:
groq: API clientpyautogui: Screenshot capturepillow: Image manipulationpython-dotenv: Environment configuration
Create a .env file in the root directory:
GROQ_API_KEY=gsk_your_api_key_here
GROQ_MODEL=meta-llama/llama-4-scout-17b-16e-instruct
TARGET_LANGUAGE=English-
Region Acquisition:
RegionSelector(extendingtk.Tk) creates a transparent fullscreen overlay to capture$(x, y, w, h)$ coordinates. -
Buffer Processing:
pyautogui.screenshotgenerates a byte-stream, encoded to Base64 (PNG format). - Inference: A vision-language request is dispatched to Groq. The model performs concurrent OCR, sentence reconstruction, and translation.
-
Structured Parsing: The MLLM returns a JSON object:
[{ "original": str, "translated": str }]. - Display: Results are rendered in a non-modal ScrolledText widget for readability.
- Install dependencies:
pip install -r requirements.txt - Run the application: Execute
Run_Translator.bator runpython translate_screenshot.pydirectly.
