-
Notifications
You must be signed in to change notification settings - Fork 4
Release/translate minimal #224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
NotYuSheng
wants to merge
6
commits into
dev
Choose a base branch
from
release/translate-minimal
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
7a29b66
feat: Create minimal translate-only release
NotYuSheng 8f81e8a
fix: Remove embedder and metadata service dependencies
NotYuSheng 709e6f4
style: Change app icon from superman to globe
NotYuSheng 4ed8f14
docs: Update C4 diagram for minimal translate-only architecture
NotYuSheng c16c220
chore: Remove unnecessary Helm charts for minimal release
NotYuSheng 49172e7
chore: Update Helm values for pre-staging deployment
NotYuSheng File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,262 +1,104 @@ | ||
| # OmniPDF | ||
| # OmniPDF Translate | ||
|
|
||
| > [!NOTE] | ||
| > Thank you for visiting! This project is currently a work in progress. Features, documentation, and deployment configurations are actively being developed and may change frequently. | ||
| > [!NOTE] | ||
| > This is a minimal staged release of OmniPDF focused exclusively on PDF translation functionality. Additional features (metadata, chat, image captioning) are available in other branches and will be released in future stages. | ||
|
|
||
| OmniPDF is a PDF analyzer capable of translation, summarization, and captioning. | ||
| OmniPDF Translate is a microservices-based PDF translation application that preserves document layout and formatting while translating content to multiple languages. | ||
|
|
||
| ## Architecture | ||
| ## Features | ||
|
|
||
| - **PDF Upload**: Simple web interface for uploading PDF documents | ||
| - **AI-Powered Translation**: Leverages LLM models for accurate, context-aware translation | ||
| - **Layout Preservation**: Maintains original document structure, fonts, and formatting | ||
| - **Multi-Language Support**: Translate to various languages through configurable LLM endpoints | ||
| - **Session Management**: Redis-backed sessions for tracking document processing state | ||
| - **Scalable Architecture**: Microservices design ready for container orchestration | ||
|
|
||
|  | ||
| ## Architecture | ||
|
|
||
| OmniPDF follows a **microservices architecture** with **centralized orchestration**: | ||
| OmniPDF Translate follows a **microservices architecture** with **centralized orchestration**: | ||
|
|
||
| - **pdf-processor-service**: Main hub that coordinates all processing workflows | ||
| - **Processing services**: Specialized services for extraction, translation, rendering, and embedding | ||
| - **Data layer**: Redis (sessions), ChromaDB (vectors), MinIO (files) | ||
| - **AI/ML layer**: vLLM text and vision-language models | ||
| - **Service mesh layer**: Istio for mTLS, traffic management, and observability (prestaging/staging/production) | ||
| ### Core Services | ||
| - **pdf-processor-service**: Central coordinator that orchestrates PDF translation workflows | ||
| - **pdf-extraction-service**: Extracts text and structure from PDFs using docling | ||
| - **docling-translation-service**: Translates docling-format JSON structures using LLM | ||
| - **pdf-renderer-service**: Overlays translated content onto original PDFs | ||
|
|
||
| ## Deployment Environments | ||
| ### Frontend & Gateway | ||
| - **frontend**: Streamlit web interface with upload and translate pages | ||
| - **nginx**: API gateway that routes requests and handles file uploads | ||
|
|
||
| OmniPDF supports multiple deployment environments with **Kubernetes + Helm**: | ||
| ### Data Services | ||
| - **redis**: Session storage and caching | ||
| - **minio**: S3-compatible object storage for PDFs and intermediate files | ||
|
|
||
| - **Development**: Docker Compose for local development | ||
| - **Pre-staging**: CodeReady Containers (CRC) with **Istio Service Mesh** + Helm charts | ||
| - **Staging**: Offline OpenShift Container Platform (OCP) with **organization's Istio** + Helm | ||
| - **Production**: Offline OpenShift Container Platform (OCP) with **organization's Istio** + Helm | ||
| ## Translation Pipeline | ||
|
|
||
| **Container Registry Patterns**: | ||
| - **Development**: Local Docker images | ||
| - **Pre-staging**: `default-route-openshift-image-registry.apps-crc.testing/omnipdf/SERVICE_NAME` | ||
| - **Staging/Production**: Internal/disconnected registries (images must be pre-mirrored) | ||
| 1. **Upload**: User uploads PDF via frontend → nginx → pdf_processor_service → MinIO | ||
| 2. **Extraction**: pdf_processor_service → pdf_extraction_service (docling extracts document structure) | ||
| 3. **Translation**: docling JSON → docling_translation_service (translates text content using LLM) | ||
| 4. **Rendering**: Translated content → pdf_renderer_service (generates translated PDF) | ||
| 5. **Download**: Frontend retrieves translated PDF via presigned URL from MinIO | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### Development (Docker Compose) | ||
|
|
||
| ```bash | ||
| # Start all services | ||
| docker compose up --build | ||
|
|
||
| # Start with GPU support (for LLM services) | ||
| docker compose -f docker-compose.gpu.yml up --build | ||
| # Access the application | ||
| # Open browser to http://localhost:8080 | ||
| ``` | ||
|
|
||
| ### Kubernetes/OpenShift (Helm) | ||
| ```bash | ||
| # Deploy individual service with explicit environment | ||
| helm install pdf-extraction-service ./helm/pdf-extraction-service \ | ||
| --values ./helm/pdf-extraction-service/values-prestaging.yaml \ | ||
| --namespace omnipdf | ||
|
|
||
| # Deploy all services using deployment script | ||
| ./scripts/deploy-helm-charts.sh --all --env prestaging | ||
|
|
||
| # Deploy RBAC only (13 individual service roles - should be deployed first) | ||
| ./scripts/deploy-helm-charts.sh --service rbac --env prestaging | ||
| ``` | ||
|
|
||
| ### Prestaging with Istio Service Mesh | ||
|
|
||
| For prestaging environment in CRC with full service mesh capabilities: | ||
|
|
||
| ```bash | ||
| # 1. Install Istio control plane | ||
| ./istio-1.27.1/bin/istioctl install --set values.defaultRevision=default -y | ||
|
|
||
| # 2. Create namespace with sidecar injection | ||
| oc create namespace omnipdf-prestaging | ||
| oc label namespace omnipdf-prestaging istio-injection=enabled | ||
|
|
||
| # 3. Deploy Istio Gateway and routing | ||
| helm install istio-gateway ./helm/istio-gateway \ | ||
| --namespace omnipdf-prestaging \ | ||
| --values ./helm/istio-gateway/values-prestaging.yaml | ||
|
|
||
| # 4. Deploy RBAC first (individual service roles) | ||
| helm install rbac ./helm/rbac \ | ||
| --namespace omnipdf-prestaging | ||
|
|
||
| # 5. Deploy services with Istio sidecars | ||
| for service in frontend pdf-processor-service embedder-service chromadb redis minio cleaner pdf-extraction-service docling-translation-service pdf-renderer-service image-captioner-service metadata-service; do | ||
| helm install $service ./helm/$service \ | ||
| --namespace omnipdf-prestaging \ | ||
| --values ./helm/$service/values-prestaging.yaml | ||
| done | ||
| ``` | ||
|
|
||
| **Istio Features Enabled:** | ||
| - **mTLS**: Automatic mutual TLS between all services | ||
| - **Traffic Management**: Intelligent routing and load balancing | ||
| - **Observability**: Distributed tracing and metrics | ||
| - **Security Policies**: Fine-grained access control | ||
|
|
||
| See [`helm/istio-gateway/INSTALL.md`](helm/istio-gateway/INSTALL.md) for detailed setup instructions. | ||
|
|
||
| ## Security Features | ||
|
|
||
| OmniPDF implements **defense-in-depth security** with multiple layers: | ||
|
|
||
| ### Service Account & RBAC | ||
| - **Individual service accounts** for each service with per-service secret isolation | ||
| - **13 individual RBAC roles** - one role per service aligned with C4 architecture: | ||
| - `pdf-processor-service-role`, `pdf-extraction-service-role`, `docling-translation-service-role` | ||
| - `embedder-service-role`, `pdf-renderer-service-role` | ||
| - `image-captioner-service-role`, `metadata-service-role` | ||
| - `minio-role`, `chromadb-role`, `redis-role` | ||
| - `frontend-role`, `nginx-gateway-role`, `cleaner-role` | ||
| - **Zero-trust security** - each service accesses only required services per C4 diagram | ||
| - **Complete audit trail** for inter-service communication | ||
|
|
||
| ### NetworkPolicy (Zero-Trust) | ||
|
|
||
| OmniPDF implements comprehensive zero-trust network policies with explicit service-to-service communication rules: | ||
|
|
||
| #### Service Communication Matrix | ||
|
|
||
| | Service | **Ingress (Who can call this service)** | **Egress (What this service can call)** | | ||
| |---------|----------------------------------------|----------------------------------------| | ||
| | **nginx** | • External traffic (users) | • istio-gateway:80/443<br>• DNS resolution | | ||
| | **istio-gateway** | • nginx | • frontend:8501<br>• pdf-processor-service:8000<br>• DNS resolution | | ||
| | **frontend** | • istio-gateway | • pdf-processor-service:8000<br>• DNS resolution | | ||
| | **pdf-processor-service** | • istio-gateway<br>• frontend | • pdf-extraction-service:8000<br>• docling-translation-service:8000<br>• pdf-renderer-service:8000<br>• embedder-service:8000<br>• metadata-service:8000<br>• minio:9000<br>• redis:6379<br>• DNS resolution | | ||
| | **pdf-extraction-service** | • pdf-processor-service | • image-captioner-service:8000<br>• minio:9000<br>• redis:6379<br>• DNS resolution | | ||
| | **docling-translation-service** | • pdf-processor-service | • minio:9000<br>• redis:6379<br>• DNS resolution<br>• HTTP/HTTPS (external vLLM text model) | | ||
| | **pdf-renderer-service** | • pdf-processor-service | • minio:9000<br>• redis:6379<br>• DNS resolution | | ||
| | **embedder-service** | • pdf-processor-service | • chromadb:8000<br>• minio:9000<br>• redis:6379<br>• DNS resolution | | ||
| | **image-captioner-service** | • pdf-extraction-service | • DNS resolution<br>• HTTP/HTTPS (external vLLM vision model) | | ||
| | **metadata-service** | • pdf-processor-service | • chromadb:8000<br>• minio:9000<br>• redis:6379<br>• DNS resolution<br>• HTTP/HTTPS (external vLLM text model) | | ||
| | **cleaner** | *No ingress (background service)* | • minio:9000<br>• chromadb:8000<br>• redis:6379<br>• DNS resolution | | ||
| | **chromadb** | • embedder-service<br>• metadata-service<br>• cleaner | • DNS resolution<br>*No outbound calls* | | ||
| | **redis** | • pdf-processor-service<br>• pdf-extraction-service<br>• docling-translation-service<br>• embedder-service<br>• pdf-renderer-service<br>• metadata-service<br>• cleaner | • DNS resolution<br>*No outbound calls* | | ||
| | **minio** | • pdf-processor-service<br>• pdf-extraction-service<br>• docling-translation-service<br>• pdf-renderer-service<br>• embedder-service<br>• metadata-service<br>• cleaner | • DNS resolution<br>*No outbound calls* | | ||
|
|
||
| #### Network Policy Configuration | ||
|
|
||
| | Environment | NetworkPolicy | Service Mesh | Description | | ||
| |-------------|---------------|--------------|-------------| | ||
| | **Development** | Disabled | None | Docker Compose - no network restrictions for local dev | | ||
| | **Prestaging** | Enabled | **Own Istio** | Zero-trust + mTLS within service mesh | | ||
| | **Staging** | Enabled | **Org Istio** | Zero-trust policies + organization's service mesh | | ||
| | **Production** | Enabled | **Org Istio** | Strict segmentation + organization's service mesh | | ||
|
|
||
| #### Key Architecture Patterns | ||
| ### Environment Setup | ||
|
|
||
| - **Service Mesh Gateway**: Istio Gateway handles external traffic in prestaging/staging/production | ||
| - **API Gateway**: nginx provides application-level routing (development) or internal routing (with Istio) | ||
| - **Orchestration Hub**: pdf-processor-service coordinates workflows across processing services | ||
| - **Data Layer Security**: Restricted access to chromadb (vectors), redis (sessions), and minio (files) | ||
| - **mTLS Communication**: Automatic mutual TLS between all services in service mesh environments | ||
| - **Background Services**: cleaner operates with minimal network permissions for cleanup tasks | ||
| - **External Connectivity**: Managed external vLLM/AI API access through ServiceEntry (Istio) or HTTPS egress | ||
|
|
||
| ### HPA (Horizontal Pod Autoscaler) | ||
| - **8 services** with auto-scaling enabled across 3 tiers: | ||
| - **Tier 1 (Critical)**: nginx, pdf-processor-service - aggressive scaling (60-70% thresholds) | ||
| - **Tier 2 (Processing)**: pdf-extraction, docling-translation, pdf-renderer - standard scaling (70% thresholds) | ||
| - **Tier 3 (Burst)**: embedder-service, image-captioner-service, metadata-service - conservative scaling (70% thresholds) | ||
| - **High availability**: Minimum 1-2 replicas with scaling up to 5-15 replicas based on service tier | ||
| - **Resource optimization**: Proactive scaling for user-facing services, workload-responsive for processing services | ||
|
|
||
| ## Security Configuration | ||
| Each service requires environment variables configured in `.env` files. Copy the `example.env` files: | ||
|
|
||
| ```bash | ||
| # Enable NetworkPolicy for production | ||
| helm upgrade pdf-extraction-service ./helm/pdf-extraction-service \ | ||
| --set networkPolicy.enabled=true \ | ||
| --namespace omnipdf | ||
|
|
||
| # Check service account permissions | ||
| kubectl auth can-i get secrets \ | ||
| --as=system:serviceaccount:omnipdf:pdf-extraction-service \ | ||
| -n omnipdf | ||
|
|
||
| # Monitor HPA status | ||
| kubectl get hpa -n omnipdf | ||
| # For each service directory | ||
| cp service_name/example.env service_name/.env | ||
| # Edit .env files with your configuration (LLM endpoints, credentials, etc.) | ||
| ``` | ||
|
|
||
| ## CRC (OpenShift Local) Setup | ||
| Key configuration: | ||
| - **LLM Configuration**: Set OPENAI_BASE_URL, OPENAI_MODEL for translation service | ||
| - **MinIO Storage**: Configure MINIO_ENDPOINT, MINIO_ACCESS_KEY, MINIO_SECRET_KEY | ||
| - **Redis**: Configure REDIS_URL for session management | ||
|
|
||
| OmniPDF uses Red Hat CodeReady Containers (CRC) for local OpenShift development. Due to the resource-intensive nature of running 8+ microservices, CRC requires significant CPU and memory allocation. | ||
|
|
||
| ### Recommended CRC Configuration | ||
| ## Testing | ||
|
|
||
| #### Quick Setup (Recommended) | ||
| ```bash | ||
| # Run the automated setup script | ||
| ./config/crc/setup-crc.sh | ||
|
|
||
| # Start CRC with configured settings | ||
| crc start | ||
|
|
||
| # Set up oc environment | ||
| eval $(crc oc-env) | ||
|
|
||
| # Get login credentials and login | ||
| crc console --credentials | ||
| oc login -u kubeadmin -p <password> https://api.crc.testing:6443 --insecure-skip-tls-verify | ||
| # Run tests for individual services | ||
| ./scripts/test-single-service.sh pdf_extraction_service | ||
| ./scripts/test-single-service.sh pdf_renderer_service | ||
| ./scripts/test-single-service.sh docling_translation_service | ||
| ``` | ||
|
|
||
| #### Manual Configuration | ||
| Alternatively, configure CRC manually: | ||
|
|
||
| ```bash | ||
| # Stop CRC if running | ||
| crc stop | ||
|
|
||
| # Configure CRC resources (adjust based on your system) | ||
| crc config set memory 32768 # 32GB RAM (adjust based on your system) | ||
| crc config set cpus 12 # 12 CPU cores (adjust based on your system) | ||
| crc config set disk-size 120 # 120GB disk (increased for ML workloads) | ||
| ## Project Structure | ||
|
|
||
| # Start CRC with new configuration | ||
| crc start | ||
| ``` | ||
|
|
||
| ### Configuration Notes | ||
|
|
||
| - **Memory**: 256GB recommended for running all microservices without constraints | ||
| - **CPU**: 32 cores provides abundant processing power for OpenShift + services | ||
| - **Disk**: 120GB recommended for container images, ML models, and persistent data | ||
| - **Configuration saved**: Current settings stored in `config/crc/crc-config.txt` | ||
|
|
||
| ### Verify Setup | ||
|
|
||
| ```bash | ||
| # Check CRC status | ||
| crc status | ||
|
|
||
| # Check node resources | ||
| oc describe node crc | grep -A 10 "Allocated resources" | ||
|
|
||
| # View current configuration | ||
| crc config view | ||
| service_name/ | ||
| ├── main.py # FastAPI app entry point | ||
| ├── Dockerfile | ||
| ├── requirements.txt | ||
| ├── .env # Environment configuration (create from example.env) | ||
| ├── example.env # Environment template | ||
| ├── models/ # Pydantic models and business logic | ||
| ├── routers/ # API route handlers | ||
| └── tests/ # Unit tests | ||
| ``` | ||
|
|
||
| ## Documentation | ||
|
|
||
| ## Testing | ||
|
|
||
| ```bash | ||
| # Run all service unit tests (180+ tests across 6 services) | ||
| ./scripts/test-all-services.sh | ||
|
|
||
| # Run tests for individual service | ||
| ./scripts/test-single-service.sh pdf-extraction-service | ||
| ## Roadmap | ||
|
|
||
| # Security scanning with Trivy | ||
| ./scripts/scan_with_trivy.sh | ||
|
|
||
| # Lint all Helm charts | ||
| find helm -maxdepth 1 -type d ! -name 'assets' ! -name 'helm' -exec helm lint {} \; | ||
| ``` | ||
| This minimal release focuses on core translation functionality. Future stages will include: | ||
|
|
||
| ## Development Workflow | ||
| - **Stage 2**: Metadata extraction and word cloud generation | ||
| - **Stage 3**: Image captioning and visual content analysis | ||
| - **Stage 4**: Chat/Q&A with RAG over translated documents | ||
| - **Stage 5**: Kubernetes/OpenShift deployment with Helm charts | ||
|
|
||
| This project uses a `Makefile` to simplify common Helm and Kubernetes operations. | ||
| ## License | ||
|
|
||
| To get started, run: | ||
|
|
||
| ```bash | ||
| make help | ||
| ``` | ||
| [Add your license information here] | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The license information is currently a placeholder. To avoid legal ambiguity and clarify the terms of use for this project, please replace
[Add your license information here]with the project's actual license (e.g., MIT, Apache 2.0) or a link to the license file.