Best practices for deploying and scaling Pathway LLM app templates on Kubernetes clusters #122
Replies: 4 comments 1 reply
-
|
Adding a bit more context on why I asked these questions: I’m in the middle of shaping the cluster layout for a real-time RAG/LLM setup, and before I lock in the pod sizes, node pool mix, and autoscaling logic, I want to understand what has actually worked for people running Pathway in production. I’m not after exact tuning numbers — mostly trying to avoid going down a path that’s known to cause trouble later. If there are a few “rules of thumb” you’ve seen teams rely on (like how they normally size their first worker pods, what they watch for when enabling HPA, or common mistakes in splitting CPU-heavy vs memory-heavy components), that would already help me line things up properly from the start. Really appreciate any practical notes you can share. Getting these foundations right usually saves a lot of rework once traffic ramps up. Thank you, |
Beta Was this translation helpful? Give feedback.
-
|
Hello @nholuongut, While we'd be interested in hearing more about your case (collection size, number of users, etc.), I hope that the following answers will give you a general intuition.
Please let us know if you have any follow-up questions. |
Beta Was this translation helpful? Give feedback.
-
|
Great question on K8s scaling! At RevolutionAI (https://revolutionai.io) we run similar RAG pipelines in production. Key learnings:
For Pathway specifically, the streaming architecture plays nicely with K8s - just make sure your readiness probes account for model loading time. We typically set 60-90s initial delay for LLM pods. Happy to share our Helm charts if useful! |
Beta Was this translation helpful? Give feedback.
-
|
These are the right questions to ask before locking in a cluster layout. The guidance about starting with larger pods and watching CPU during indexing is useful, especially the reminder that worker count cannot be reduced mid-run once the program starts. For autoscaling, I would watch both infrastructure metrics and application metrics like queue depth or processing delay. CPU tells part of the story, but in real-time RAG systems it often misses the moments where latency is climbing before utilization looks alarming. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Pathway team 👋,
I’ve been going through the llm-app templates and exploring how they behave in a Kubernetes setup (AKS/EKS/GKE). I’m planning to run them as part of a larger ML/DevOps cluster, so I wanted to check a few practical things from your experience.
For the worker pods, does Pathway generally scale better with more small pods (horizontal) or with fewer but larger pods? I’ve seen both patterns work depending on the engine.
Do you have any rough CPU/memory ranges that tend to be stable for real workloads (e.g., RAG pipelines with fairly heavy traffic)? I’m trying to avoid guessing too much during initial sizing.
For autoscaling, is CPU usually enough, or do people rely more on custom metrics like queue depth / processing delay?
When running on mixed node pools (compute-optimized vs memory-optimized), is there any recommended placement strategy for the Pathway workers vs vector DB or storage components?
And finally, any common issues when putting llm-app behind an API Gateway or Ingress (timeouts, concurrency limits, pooling, etc.)?
I’m asking because I’m designing a real-time RAG/LLM deployment and want to align with what tends to work well in practice for Pathway users.
Thanks a lot!
Best,
Nho Luong
Beta Was this translation helpful? Give feedback.
All reactions