Release/translate minimal#224
Conversation
This commit creates a staged minimal release of OmniPDF focusing exclusively on PDF translation functionality. Changes: - Removed unnecessary services: chat_service, embedder_service, image_captioner_service, metadata_service, cleaner - Removed ChromaDB dependency (not needed for translation) - Updated docker-compose.yml to only include translation pipeline services - Simplified frontend to show only Upload and Translate pages - Removed unnecessary frontend pages (images, tables, wordcloud, metadata, settings) - Updated README.md to reflect minimal translate-only scope - Updated CLAUDE.md documentation for new architecture Services retained for translation pipeline: - pdf_processor_service (orchestrator) - pdf_extraction_service (docling extraction) - docling_translation_service (LLM translation) - pdf_renderer_service (PDF generation) - frontend (Streamlit UI) - nginx (API gateway) - redis (session management) - minio (file storage) Future features (metadata, chat, image captioning) remain in dev branch for staged releases. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Removed all references to embedder_service and metadata_service to complete the minimal translate-only release. Changes in pdf_processor_service: - Removed embed, metadata, and wordcloud routers from main.py - Deleted routers: embed.py, metadata.py, wordcloud.py - Updated utils/process.py to only call extraction (removed embedder and metadata calls) - Cleaned utils/proxy.py to remove load_or_create_*_embedder_job and load_or_create_metadata_job functions - Removed EMBED_URL and METADATA_URL from .env and example.env Changes in frontend: - Simplified upload_UI.py processing pipeline to only show: Extraction → Translation → Rendering - Removed embedding and metadata stages from status display - Updated processing flow to linear pipeline instead of multi-stage concurrent processing This completes the minimal release with only the essential translation pipeline services. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Changed the app icon from 🦸 (superman) to 🌐 (globe) which better represents the translation functionality. Updated in: - frontend/main.py: page_icon - frontend/my_pages/1_upload_UI.py: main header 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Updated the C4 container diagram to reflect the minimal translate-only release architecture. Changes: - Removed Istio Service Mesh components (simplified for minimal release) - Removed embedder, image_captioner, metadata_service, cleaner, and chromadb - Updated title to "OmniPDF Translate - Minimal Release" - Simplified to core translation pipeline: nginx → frontend → pdf_processor → extraction → translation → renderer - Added note explaining the 6-step translation workflow - Listed removed services for clarity - Updated relationships to show only essential data flows The diagram now accurately represents the minimal 8-service architecture: - 4 core services (pdf_processor, pdf_extraction, docling_translation, pdf_renderer) - 2 UI/gateway (frontend, nginx) - 2 data stores (redis, minio) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Removed Helm charts for services not needed in the minimal translate-only release. Deleted charts: - chromadb (no vector storage needed) - cleaner (no automatic cleanup) - embedder-service (no embedding generation) - image-captioner-service (no image captioning) - metadata-service (no metadata generation) - istio-gateway (simplified deployment without service mesh) Updated helm/rbac/values.yaml: - Removed service accounts for deleted services - Updated pdf-processor-service to only call: pdf-extraction, docling-translation, pdf-renderer - Removed pdf-extraction-service call to image-captioner - Removed all chromadb access permissions - Simplified RBAC matrix to match minimal architecture Remaining Helm charts (8 services): - frontend - nginx - pdf-processor-service - pdf-extraction-service - docling-translation-service - pdf-renderer-service - minio - redis 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Update image repositories to omnipdf-prestaging namespace - Add environment variable configurations for all services - Disable pod disruption budgets for pre-staging flexibility - Update images.txt to reflect minimal translate-only release - Add OpenShift-compatible security contexts for nginx - Configure service endpoints and MinIO/Redis connections 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Summary of ChangesHello @NotYuSheng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request represents a strategic pivot for the OmniPDF project, consolidating its capabilities into a minimal viable product centered solely on PDF translation. This focused release aims to deliver a robust core translation experience by eliminating non-essential features and their underlying infrastructure, simplifying development, deployment, and maintenance. The changes are comprehensive, touching nearly every aspect of the project from documentation and architecture to code and deployment configurations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively refactors the project into a minimal, translation-focused release by removing several features and their corresponding services, such as embedding, metadata generation, and image captioning. The documentation, C4 diagram, and service configurations have been updated to reflect this smaller scope, which is a great step towards clarity. My review focuses on some configuration and security improvements in the updated Helm charts. The most critical issue is the hardcoding of secrets in the values.yaml files, which poses a security risk. I've also noted some potential configuration errors with empty secret names and a placeholder in the README that needs to be addressed.
| - name: MINIO_ACCESS_KEY | ||
| value: "minioadmin" | ||
| - name: MINIO_SECRET_KEY | ||
| value: "minioadmin" | ||
| - name: LLM_API_TOKEN | ||
| value: "token-abc123" |
There was a problem hiding this comment.
Hardcoding secrets like MINIO_ACCESS_KEY, MINIO_SECRET_KEY, and LLM_API_TOKEN in Helm values is a significant security risk, as these credentials will be stored in version control. It is strongly recommended to manage these secrets using Kubernetes Secrets. The deployment should then reference these secrets using valueFrom.secretKeyRef.
| - name: MINIO_ACCESS_KEY | ||
| value: "minioadmin" | ||
| - name: MINIO_SECRET_KEY | ||
| value: "minioadmin" |
There was a problem hiding this comment.
Hardcoding secrets like MINIO_ACCESS_KEY and MINIO_SECRET_KEY in Helm values is a significant security risk, as these credentials will be stored in version control. It is strongly recommended to manage these secrets using Kubernetes Secrets. The deployment should then reference these secrets using valueFrom.secretKeyRef.
| - name: MINIO_ACCESS_KEY | ||
| value: "minioadmin" | ||
| - name: MINIO_SECRET_KEY | ||
| value: "minioadmin" |
There was a problem hiding this comment.
Hardcoding secrets like MINIO_ACCESS_KEY and MINIO_SECRET_KEY in Helm values is a significant security risk, as these credentials will be stored in version control. It is strongly recommended to manage these secrets using Kubernetes Secrets. The deployment should then reference these secrets using valueFrom.secretKeyRef.
| ```bash | ||
| make help | ||
| ``` | ||
| [Add your license information here] |
There was a problem hiding this comment.
| create: false | ||
| name: docling-translation-service-secrets | ||
| name: "" |
| service: | ||
| type: ClusterIP | ||
| port: 9000 | ||
| consolePort: 9001 |
| create: false | ||
| name: nginx-secrets | ||
| name: "" |
| create: false | ||
| name: pdf-extraction-service-secrets | ||
| name: "" |
| create: false | ||
| name: pdf-processor-service-prestaging-secrets | ||
| name: "" |
| create: false | ||
| name: pdf-renderer-service-secrets | ||
| name: "" |
Note, latest branch before project discontinuation