Best practices for deploying and scaling Pathway LLM app templates on Kubernetes clusters #122

nholuongut · 2025-11-17T10:31:59Z

nholuongut
Nov 17, 2025

Hi Pathway team 👋,

I’ve been going through the llm-app templates and exploring how they behave in a Kubernetes setup (AKS/EKS/GKE). I’m planning to run them as part of a larger ML/DevOps cluster, so I wanted to check a few practical things from your experience.

For the worker pods, does Pathway generally scale better with more small pods (horizontal) or with fewer but larger pods? I’ve seen both patterns work depending on the engine.
Do you have any rough CPU/memory ranges that tend to be stable for real workloads (e.g., RAG pipelines with fairly heavy traffic)? I’m trying to avoid guessing too much during initial sizing.
For autoscaling, is CPU usually enough, or do people rely more on custom metrics like queue depth / processing delay?
When running on mixed node pools (compute-optimized vs memory-optimized), is there any recommended placement strategy for the Pathway workers vs vector DB or storage components?
And finally, any common issues when putting llm-app behind an API Gateway or Ingress (timeouts, concurrency limits, pooling, etc.)?

I’m asking because I’m designing a real-time RAG/LLM deployment and want to align with what tends to work well in practice for Pathway users.

Thanks a lot!

Best,
Nho Luong

nholuongut · 2025-11-17T10:35:19Z

nholuongut
Nov 17, 2025
Author

Adding a bit more context on why I asked these questions:

I’m in the middle of shaping the cluster layout for a real-time RAG/LLM setup, and before I lock in the pod sizes, node pool mix, and autoscaling logic, I want to understand what has actually worked for people running Pathway in production. I’m not after exact tuning numbers — mostly trying to avoid going down a path that’s known to cause trouble later.

If there are a few “rules of thumb” you’ve seen teams rely on (like how they normally size their first worker pods, what they watch for when enabling HPA, or common mistakes in splitting CPU-heavy vs memory-heavy components), that would already help me line things up properly from the start.

Really appreciate any practical notes you can share. Getting these foundations right usually saves a lot of rework once traffic ramps up.

Thank you,
Nho Luong

0 replies

zxqfd555 · 2025-11-18T14:33:06Z

zxqfd555
Nov 18, 2025
Maintainer

Hello @nholuongut,

While we'd be interested in hearing more about your case (collection size, number of users, etc.), I hope that the following answers will give you a general intuition.

At the start, I suggest using the larger pods. The reason for that is that the current version of llm-app parses the collection within a machine (although it can be parallelized to several CPUs), and the constructed index of document embeddings will also be stored on a single machine. I would estimate whether this kind of deployment serves your needs; otherwise, you can probably give a shot at the solution with several shards, each served by a smaller pod, but that would be more complex to maintain.
When starting, you specify the number of Pathway workers to perform the indexing. It would be impractical to add more CPUs than the workers in this kind of deployment, so you can approximate the number of CPUs as the number of workers you define. When the indexing is done, it makes sense to expect that your CPU usage goes down - this metric can be tracked and then downscaled. Speaking of memory, the safe bet is the size of your collection. This can be greatly optimized with a bit harder pipeline that we can discuss - it's hard for me to conclude if it is needed right now.
As mentioned above, the indexing is the most CPU-intensive part; after it's done, you'll probably consider downscaling the pipeline. It must be trackable from the CPU usage. Please keep in mind that once you've started the program, you can't downscale the number of workers, but if idle, their resource consumption will already be low.
The base setup in this repository doesn't include the Vector DB (it's an option, but would be harder). For llm-app, we haven't had measurements on the different mixed node pools. Given the nature of the computation, I'd say you can try with a memory-optimized pool. Please also note that the current implementation consists of a single component: the index is a part of the Pathway program, so there's no need to reconcile Pathway with other components. But if you have a more complex setup, please let us know the details.
The llm-app uses a regular web server that doesn't have any peculiarities, at least at a glance. However, as a practical consideration, I suggest setting timeouts several times higher than the autocommit_duration_ms parameter in the respective input connectors. The standard server implementation doesn't include any limits concerning concurrency and pooling.

Please let us know if you have any follow-up questions.

1 reply

nholuongut Nov 19, 2025
Author

Hi @zxqfd555

Thanks a lot for the super detailed reply — this is really helpful for the setup I’m designing.

Let me share a bit of my perspective from actually deploying similar workloads on AKS/EKS, and you can let me know if I’m interpreting the current Pathway behavior correctly:

About starting with larger pods
This lines up well with what I’ve seen on CPU-heavy index builders. The initial “scan + index-build” stage is usually the biggest bottleneck, so packing more CPU into fewer big pods tends to finish the init phase faster. Horizontal scaling doesn’t help much there because the workload inside Pathway is still mostly a linear job.
→ So yeah, using bigger pods at the beginning makes sense.

Worker count = actual usable CPUs for indexing
I understand your point that Pathway doesn’t split workers into separate processes yet, so the number of workers basically defines how many CPU threads can be used. This makes it easy to predict CPU usage during the initial indexing phase — just set workers equal to the number of CPUs you want Pathway to consume.
→ That’s a very straightforward tuning approach for the first run.

CPU-heavy during indexing, then downscale when idle
From past real-time RAG deployments, this behavior sounds accurate: the moment indexing is done, the pipeline becomes event-driven and usually stays “low CPU”. I’ll track CPU usage directly for HPA thresholds.
→ The key thing I’m taking away is: since worker count can’t be reduced after the program starts, sizing it correctly upfront really matters.

Memory-optimized node pools for llm-app
I noticed your point about no benchmarks on mixed node pools, but it does sound like llm-app becomes more memory-heavy during runtime (when indexing is done), while CPU only spikes in the beginning. I’m splitting the vector DB out anyway, so trying llm-app on a memory-optimized pool sounds reasonable.
→ This aligns nicely with what I’m planning: CPU spike upfront, then memory-bound during normal operation.

Timeouts behind API Gateway / Ingress
What you said about setting timeouts higher than autocommit_duration_ms makes sense. From experience, a lot of “fake timeouts” come from the gateway layer, not the service itself.
→ I’ll bump AGIC / Nginx / API Gateway timeouts a few times above that value to be safe.

If you also have any rules of thumb around sharding the indexing stage (for horizontal scaling during initialization) or tips for long-running index updates, I’d love to hear them.

Thanks again — your reply really clarified the direction I should go.

xXMrNidaXx · 2026-02-23T13:03:08Z

xXMrNidaXx
Feb 23, 2026

Great question on K8s scaling! At RevolutionAI (https://revolutionai.io) we run similar RAG pipelines in production. Key learnings:

Horizontal pod autoscaling - scale on custom metrics like queue depth, not just CPU
Separate retrieval and generation - different scaling characteristics, deploy as separate services
Connection pooling - especially for vector DBs, connection overhead kills performance
Warm pools - keep some replicas pre-warmed to handle burst traffic

For Pathway specifically, the streaming architecture plays nicely with K8s - just make sure your readiness probes account for model loading time. We typically set 60-90s initial delay for LLM pods.

Happy to share our Helm charts if useful!

0 replies

aniruddhaadak80 · 2026-03-10T06:22:17Z

aniruddhaadak80
Mar 10, 2026

These are the right questions to ask before locking in a cluster layout. The guidance about starting with larger pods and watching CPU during indexing is useful, especially the reminder that worker count cannot be reduced mid-run once the program starts.

For autoscaling, I would watch both infrastructure metrics and application metrics like queue depth or processing delay. CPU tells part of the story, but in real-time RAG systems it often misses the moments where latency is climbing before utilization looks alarming.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for deploying and scaling Pathway LLM app templates on Kubernetes clusters #122

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Best practices for deploying and scaling Pathway LLM app templates on Kubernetes clusters #122

Uh oh!

nholuongut Nov 17, 2025

Replies: 4 comments · 1 reply

Uh oh!

nholuongut Nov 17, 2025 Author

Uh oh!

zxqfd555 Nov 18, 2025 Maintainer

Uh oh!

nholuongut Nov 19, 2025 Author

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

aniruddhaadak80 Mar 10, 2026

nholuongut
Nov 17, 2025

Replies: 4 comments 1 reply

nholuongut
Nov 17, 2025
Author

zxqfd555
Nov 18, 2025
Maintainer

nholuongut Nov 19, 2025
Author

xXMrNidaXx
Feb 23, 2026

aniruddhaadak80
Mar 10, 2026