Skip to content

Release/translate minimal#224

Open
NotYuSheng wants to merge 6 commits into
devfrom
release/translate-minimal
Open

Release/translate minimal#224
NotYuSheng wants to merge 6 commits into
devfrom
release/translate-minimal

Conversation

@NotYuSheng

Copy link
Copy Markdown
Owner

Note, latest branch before project discontinuation

NotYuSheng and others added 6 commits December 10, 2025 09:18
This commit creates a staged minimal release of OmniPDF focusing exclusively on PDF translation functionality.

Changes:
- Removed unnecessary services: chat_service, embedder_service, image_captioner_service, metadata_service, cleaner
- Removed ChromaDB dependency (not needed for translation)
- Updated docker-compose.yml to only include translation pipeline services
- Simplified frontend to show only Upload and Translate pages
- Removed unnecessary frontend pages (images, tables, wordcloud, metadata, settings)
- Updated README.md to reflect minimal translate-only scope
- Updated CLAUDE.md documentation for new architecture

Services retained for translation pipeline:
- pdf_processor_service (orchestrator)
- pdf_extraction_service (docling extraction)
- docling_translation_service (LLM translation)
- pdf_renderer_service (PDF generation)
- frontend (Streamlit UI)
- nginx (API gateway)
- redis (session management)
- minio (file storage)

Future features (metadata, chat, image captioning) remain in dev branch for staged releases.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Removed all references to embedder_service and metadata_service to complete the minimal translate-only release.

Changes in pdf_processor_service:
- Removed embed, metadata, and wordcloud routers from main.py
- Deleted routers: embed.py, metadata.py, wordcloud.py
- Updated utils/process.py to only call extraction (removed embedder and metadata calls)
- Cleaned utils/proxy.py to remove load_or_create_*_embedder_job and load_or_create_metadata_job functions
- Removed EMBED_URL and METADATA_URL from .env and example.env

Changes in frontend:
- Simplified upload_UI.py processing pipeline to only show: Extraction → Translation → Rendering
- Removed embedding and metadata stages from status display
- Updated processing flow to linear pipeline instead of multi-stage concurrent processing

This completes the minimal release with only the essential translation pipeline services.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changed the app icon from 🦸 (superman) to 🌐 (globe) which better represents the translation functionality.

Updated in:
- frontend/main.py: page_icon
- frontend/my_pages/1_upload_UI.py: main header

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Updated the C4 container diagram to reflect the minimal translate-only release architecture.

Changes:
- Removed Istio Service Mesh components (simplified for minimal release)
- Removed embedder, image_captioner, metadata_service, cleaner, and chromadb
- Updated title to "OmniPDF Translate - Minimal Release"
- Simplified to core translation pipeline: nginx → frontend → pdf_processor → extraction → translation → renderer
- Added note explaining the 6-step translation workflow
- Listed removed services for clarity
- Updated relationships to show only essential data flows

The diagram now accurately represents the minimal 8-service architecture:
- 4 core services (pdf_processor, pdf_extraction, docling_translation, pdf_renderer)
- 2 UI/gateway (frontend, nginx)
- 2 data stores (redis, minio)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Removed Helm charts for services not needed in the minimal translate-only release.

Deleted charts:
- chromadb (no vector storage needed)
- cleaner (no automatic cleanup)
- embedder-service (no embedding generation)
- image-captioner-service (no image captioning)
- metadata-service (no metadata generation)
- istio-gateway (simplified deployment without service mesh)

Updated helm/rbac/values.yaml:
- Removed service accounts for deleted services
- Updated pdf-processor-service to only call: pdf-extraction, docling-translation, pdf-renderer
- Removed pdf-extraction-service call to image-captioner
- Removed all chromadb access permissions
- Simplified RBAC matrix to match minimal architecture

Remaining Helm charts (8 services):
- frontend
- nginx
- pdf-processor-service
- pdf-extraction-service
- docling-translation-service
- pdf-renderer-service
- minio
- redis

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update image repositories to omnipdf-prestaging namespace
- Add environment variable configurations for all services
- Disable pod disruption budgets for pre-staging flexibility
- Update images.txt to reflect minimal translate-only release
- Add OpenShift-compatible security contexts for nginx
- Configure service endpoints and MinIO/Redis connections

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @NotYuSheng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents a strategic pivot for the OmniPDF project, consolidating its capabilities into a minimal viable product centered solely on PDF translation. This focused release aims to deliver a robust core translation experience by eliminating non-essential features and their underlying infrastructure, simplifying development, deployment, and maintenance. The changes are comprehensive, touching nearly every aspect of the project from documentation and architecture to code and deployment configurations.

Highlights

  • Project Scope Reduction: The project has been significantly streamlined to focus exclusively on PDF translation functionality, removing features such as image captioning, metadata generation, and document embedding.
  • Architecture Simplification: The microservices architecture has been simplified, with several services (embedder, image captioner, metadata, cleaner, ChromaDB) and their associated infrastructure (Docker Compose, Helm charts, RBAC) being removed.
  • Frontend UI Updates: The frontend user interface has been updated to reflect the narrowed scope, removing navigation links and processing status indicators for the discontinued features.
  • Documentation Overhaul: The README.md and C4 architecture diagram have been completely rewritten to accurately describe the minimal translation-focused release, including a new roadmap for potential future feature re-introductions.
  • Helm Chart and Environment Configuration Cleanup: Numerous Helm charts for removed services have been deleted, and existing charts have been updated to remove dependencies and adjust environment variables, reflecting the streamlined deployment.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively refactors the project into a minimal, translation-focused release by removing several features and their corresponding services, such as embedding, metadata generation, and image captioning. The documentation, C4 diagram, and service configurations have been updated to reflect this smaller scope, which is a great step towards clarity. My review focuses on some configuration and security improvements in the updated Helm charts. The most critical issue is the hardcoding of secrets in the values.yaml files, which poses a security risk. I've also noted some potential configuration errors with empty secret names and a placeholder in the README that needs to be addressed.

Comment on lines +47 to +52
- name: MINIO_ACCESS_KEY
value: "minioadmin"
- name: MINIO_SECRET_KEY
value: "minioadmin"
- name: LLM_API_TOKEN
value: "token-abc123"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding secrets like MINIO_ACCESS_KEY, MINIO_SECRET_KEY, and LLM_API_TOKEN in Helm values is a significant security risk, as these credentials will be stored in version control. It is strongly recommended to manage these secrets using Kubernetes Secrets. The deployment should then reference these secrets using valueFrom.secretKeyRef.

Comment on lines +47 to +50
- name: MINIO_ACCESS_KEY
value: "minioadmin"
- name: MINIO_SECRET_KEY
value: "minioadmin"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding secrets like MINIO_ACCESS_KEY and MINIO_SECRET_KEY in Helm values is a significant security risk, as these credentials will be stored in version control. It is strongly recommended to manage these secrets using Kubernetes Secrets. The deployment should then reference these secrets using valueFrom.secretKeyRef.

Comment on lines +47 to +50
- name: MINIO_ACCESS_KEY
value: "minioadmin"
- name: MINIO_SECRET_KEY
value: "minioadmin"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Hardcoding secrets like MINIO_ACCESS_KEY and MINIO_SECRET_KEY in Helm values is a significant security risk, as these credentials will be stored in version control. It is strongly recommended to manage these secrets using Kubernetes Secrets. The deployment should then reference these secrets using valueFrom.secretKeyRef.

Comment thread README.md
```bash
make help
```
[Add your license information here]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The license information is currently a placeholder. To avoid legal ambiguity and clarify the terms of use for this project, please replace [Add your license information here] with the project's actual license (e.g., MIT, Apache 2.0) or a link to the license file.

Comment on lines 24 to +25
create: false
name: docling-translation-service-secrets
name: ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting secrets.name to an empty string is likely to cause deployment errors if the Helm template expects a valid secret name. If this service no longer requires a Kubernetes Secret, it would be clearer to remove this secrets block entirely.

Comment thread helm/minio/values.yaml
service:
type: ClusterIP
port: 9000
consolePort: 9001

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Exposing the MinIO console port is a helpful addition for debugging and management in development and prestaging environments. Good catch!

Comment thread helm/nginx/values.yaml
Comment on lines 21 to +22
create: false
name: nginx-secrets
name: ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting secrets.name to an empty string is likely to cause deployment errors if the Helm template expects a valid secret name. If this service no longer requires a Kubernetes Secret, it would be clearer to remove this secrets block entirely.

Comment on lines 24 to +25
create: false
name: pdf-extraction-service-secrets
name: ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting secrets.name to an empty string is likely to cause deployment errors if the Helm template expects a valid secret name. If this service no longer requires a Kubernetes Secret, it would be clearer to remove this secrets block entirely.

Comment on lines 24 to +25
create: false
name: pdf-processor-service-prestaging-secrets
name: ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting secrets.name to an empty string is likely to cause deployment errors if the Helm template expects a valid secret name. If this service no longer requires a Kubernetes Secret, it would be clearer to remove this secrets block entirely.

Comment on lines 21 to +22
create: false
name: pdf-renderer-service-secrets
name: ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting secrets.name to an empty string is likely to cause deployment errors if the Helm template expects a valid secret name. If this service no longer requires a Kubernetes Secret, it would be clearer to remove this secrets block entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant