MCP Horizontal Scaling Template (Docker Swarm)

Production-Ready Template: Stateless MCP server template for horizontal scaling in Docker Swarm

This template demonstrates how to build and deploy MCP servers that can scale horizontally across multiple containers in Docker Swarm. It uses FastMCP's stateless_http=True mode to enable true load balancing without session affinity.

🔬 Research Summary

The Problem

MCP servers are stateful by design:

SSE (Server-Sent Events) connections stay open
Session IDs are tracked in server memory
If a request lands on a different replica → 400 Bad Request

What We Tried

Approach	Result	Why
Redis Session Store	❌ Failed	FastMCP has internal `StreamableHTTP session manager` that keeps sessions in memory. Our middleware works, but FastMCP blocks cross-replica requests internally.
Sticky Sessions	✅ Works	Load balancer routes same client to same replica. Not true scaling but works.
Stateless HTTP Mode	✅ Works	Official solution - no sessions at all!

The Solution: `stateless_http=True`

The official Python SDK has a stateless mode specifically designed for horizontal scaling:

from fastmcp import FastMCP

# Enable stateless mode for multi-node deployment
mcp = FastMCP(
    name="my-server",
    stateless_http=True,   # No session tracking
    json_response=True     # JSON instead of SSE
)

What stateless mode does:

❌ No Mcp-Session-Id header issued or required
❌ No session state between requests
✅ Each request is completely independent
✅ Any replica can handle any request
✅ True round-robin load balancing works!

📊 Mode Comparison

Feature	Stateful (Default)	Stateless Mode
Session tracking	✅ In-memory	❌ None
Multi-replica	❌ Needs sticky sessions	✅ Native support
Server-push notifications	✅ Supported	❌ Not available
SSE streaming	✅ Supported	⚠️ Limited
Progress updates	✅ Real-time	❌ Must poll
Load balancing	Sticky/affinity	Round-robin
Best for	Long-running, interactive	Simple tools, APIs

🎯 When to Use This Template

✅ USE This Template When:

High-Concurrency Scenarios:

Serving 100+ concurrent users
API gateway for multiple LLM clients
Multi-tenant SaaS applications
Public-facing MCP services

Stateless Operations:

Database queries (read-only or simple CRUD)
REST API calls and data transformations
File system operations (read files, list directories)
Simple calculations or data processing
Short-lived tool calls (<30 seconds)

Infrastructure Requirements:

Docker Swarm or Kubernetes deployment
Need for high availability (HA)
Auto-scaling requirements
Load balancing across multiple nodes

Example Use Cases:

Database MCP Server: Query company databases for sales data, customer info
API Integration MCP: Call external APIs (weather, stock prices, CRM data)
File System MCP: Read configuration files, search logs, list directories
Analytics MCP: Run analytics queries and return aggregated results
Utilities MCP: Data transformation, format conversion, validation

❌ DON'T Use This Template When:

Stateful Operations:

Long-running tasks (>1 minute) with progress updates
Streaming responses with real-time updates
Interactive workflows with multiple steps
Server-initiated notifications

Session-Dependent Features:

Elicitation with callbacks
Sampling (LLM completions from server)
Multi-step wizards with state
WebSocket-style bidirectional communication

Alternative Solution: For stateful operations, use standard MCP with sticky sessions or a single-replica deployment.

💡 Real-World Use Case Examples

Example 1: Analytics MCP for Sales Dashboard

Scenario: 100 sales reps querying company database simultaneously
Traffic: 50-100 concurrent requests
Tools: get_sales_by_region, get_top_customers, get_revenue_trend
Why Swarm: High concurrency, simple queries, stateless operations
Result: 54+ req/sec throughput, 100% success rate

Example 2: Multi-Tenant API Gateway MCP

Scenario: SaaS product serving 50 companies, each with multiple AI agents
Traffic: 200+ concurrent requests across tenants
Tools: call_external_api, transform_data, validate_schema
Why Swarm: Multi-tenancy, horizontal scaling, load distribution
Result: Even distribution across replicas, no single point of failure

Example 3: File System MCP for DevOps

Scenario: Multiple CI/CD pipelines querying deployment configs
Traffic: Bursty (10-50 concurrent during deployments)
Tools: read_config, list_deployments, check_service_status
Why Swarm: Bursty traffic, HA requirement, simple read operations
Result: Auto-scaling handles traffic spikes

🎯 When to Use Each Mode

Use Stateless Mode (`stateless_http=True`) for:

✅ High-concurrency APIs with many users
✅ Simple tool calls (query database, call APIs)
✅ Kubernetes/Swarm deployments
✅ Serverless/Lambda deployments
✅ When you don't need server-to-client push

Use Stateful Mode (default) for:

✅ Long-running tasks with progress updates
✅ Server-initiated notifications
✅ Sampling (LLM completions from server)
✅ Elicitation with callbacks
✅ Single-replica deployments

🚀 Quick Start

1. Build the Image

cd template_mcp_swarm_poc
docker build -t template_mcp_swarm:latest -f server/Dockerfile server/

2. Deploy to Swarm

# Initialize swarm (if not already)
docker swarm init

# Deploy with 3 replicas
docker stack deploy -c stack.yml mcp-swarm

3. Test Load Balancing

# Each request may hit a different replica
for i in {1..5}; do
  curl -s http://localhost:8150/version | jq .hostname
done

4. Test Tool Calls

# All replicas can handle any request
curl -X POST http://localhost:8150/mcp \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'

5. Run Load Test (Proof of Concept)

# Run with 100 requests, 20 concurrent
py -3.12 test_swarm_load.py --requests 100 --concurrent 20

# Heavy load test: 300 requests, 50 concurrent
py -3.12 test_swarm_load.py --requests 300 --concurrent 50

✅ Proof of Concept: Load Test Results

Test Configuration

Date: January 2, 2026
MCP Version: 1.16.0, FastMCP 2.12.4
Replicas: 3 Docker containers in Swarm mode
Transport: Streamable HTTP with JSON responses

Test 1: Standard Load (100 requests, 20 concurrent)

🏋️  MCP SWARM LOAD TEST REPORT
======================================================================
📊 SUMMARY
   Total Requests:     100
   Successful:         100
   Failed:             0
   Success Rate:       100.0%
   Test Duration:      2.27 seconds
   Throughput:         44.0 req/sec

⏱️  RESPONSE TIMES
   Min:                43.7 ms
   Max:                274.2 ms
   Average:            146.7 ms

🐳 REPLICA DISTRIBUTION (Proof of Load Balancing)
   Replicas Used:      3
--------------------------------------------------
   ad1b23ba0c29 |   32 requests ( 32.0%) | avg 144ms | ████████████████
   d0b72722c047 |   33 requests ( 33.0%) | avg 147ms | ████████████████
   fac34020dd98 |   35 requests ( 35.0%) | avg 148ms | █████████████████

✅ VERDICT: Load balancing is WORKING! Multiple replicas handled requests.

Test 2: Heavy Load (300 requests, 50 concurrent)

🏋️  MCP SWARM LOAD TEST REPORT
======================================================================
📊 SUMMARY
   Total Requests:     300
   Successful:         300
   Failed:             0
   Success Rate:       100.0%
   Test Duration:      5.53 seconds
   Throughput:         54.2 req/sec

⏱️  RESPONSE TIMES
   Min:                90.6 ms
   Max:                828.4 ms
   Average:            433.8 ms

🐳 REPLICA DISTRIBUTION (Proof of Load Balancing)
   Replicas Used:      3
--------------------------------------------------
   ad1b23ba0c29 |  123 requests ( 41.0%) | avg 432ms | ██████████████████████
   d0b72722c047 |   52 requests ( 17.3%) | avg 408ms | █████████
   fac34020dd98 |  125 requests ( 41.7%) | avg 446ms | ██████████████████████

✅ VERDICT: Load balancing is WORKING! Multiple replicas handled requests.

Key Findings

Metric	Standard Load	Heavy Load
Total Requests	100	300
Concurrent	20	50
Success Rate	100%	100%
Throughput	44 req/sec	54 req/sec
Avg Response	147ms	434ms
Replicas Used	3/3	3/3

Conclusion: Horizontal scaling with stateless_http=True works perfectly. All 3 replicas handled requests successfully with even distribution.

🔬 Detailed Validation with Container Logs

Docker Logs Show True Load Balancing

Container logs confirm that all 3 replicas processed requests during load testing:

# Replica 1 (mcp-swarm_mcp.1.ievzl6i4a8wm)
2026-01-02 06:33:03,969 - utils.request_logging - INFO - [c565931a] ✓ 200 POST /mcp (72ms)
2026-01-02 06:33:04,136 - utils.request_logging - INFO - [43299087] ✓ 200 POST /mcp (71ms)
2026-01-02 06:33:04,136 - utils.request_logging - INFO - [659ca36f] ✓ 200 POST /mcp (68ms)
INFO: 10.0.0.2:43442 - "POST /mcp HTTP/1.1" 200 OK

# Replica 2 (mcp-swarm_mcp.2.fa38bizm88po)
2026-01-02 06:33:04,122 - utils.request_logging - INFO - [86ccb76d] ✓ 200 POST /mcp (50ms)
2026-01-02 06:33:04,149 - utils.request_logging - INFO - [e0cf56ea] ✓ 200 POST /mcp (68ms)
2026-01-02 06:33:04,149 - utils.request_logging - INFO - [fed7b42a] ✓ 200 POST /mcp (73ms)
INFO: 10.0.0.2:43464 - "POST /mcp HTTP/1.1" 200 OK

# Replica 3 (mcp-swarm_mcp.3.lwphmixr2bps)
2026-01-02 06:33:04,128 - utils.request_logging - INFO - [235902b1] ✓ 200 POST /mcp (67ms)
2026-01-02 06:33:04,146 - utils.request_logging - INFO - [26ca2308] ✓ 200 POST /mcp (67ms)
2026-01-02 06:33:04,146 - utils.request_logging - INFO - [7a710a21] ✓ 200 POST /mcp (65ms)
INFO: 10.0.0.2:43456 - "POST /mcp HTTP/1.1" 200 OK

What This Proves:

✅ All 3 replicas are active and processing requests
✅ Load balancing works - requests distributed across containers
✅ No session errors - stateless mode allows any replica to handle any request
✅ Concurrent processing - multiple requests handled simultaneously per replica
✅ Consistent performance - 50-73ms response times across all replicas

X-Served-By Header Validation

Each response includes the X-Served-By header showing which container handled the request:

# Test multiple requests
for i in {1..5}; do
  curl -s http://localhost:8150/version | jq '.hostname'
done

# Output shows different hostnames (load balancing in action):
"ad1b23ba0c29"
"fac34020dd98"
"d0b72722c047"
"ad1b23ba0c29"
"fac34020dd98"

This header is added by our HostnameHeaderMiddleware in server.py and enables accurate tracking of replica distribution during load tests.

🏗️ Architecture

Stateless Architecture (Recommended for Scaling)

                    ┌─────────────────┐
                    │  Load Balancer  │
                    │  (Round Robin)  │
                    └────────┬────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
        ▼                    ▼                    ▼
  ┌──────────┐        ┌──────────┐        ┌──────────┐
  │ MCP #1   │        │ MCP #2   │        │ MCP #3   │
  │ stateless│        │ stateless│        │ stateless│
  └──────────┘        └──────────┘        └──────────┘
        │                    │                    │
        └────────────────────┼────────────────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  Database/APIs  │
                    │ (Shared State)  │
                    └─────────────────┘

Key Points:

No session affinity needed
Any replica handles any request
Shared state goes to database, not memory
True horizontal scaling achieved

📁 Project Structure

template_mcp_swarm_poc/
├── stack.yml               # Swarm deployment (3 replicas)
├── docker-compose.yml      # Local development
├── test_mcp_client.py      # MCP client test (official SDK)
├── test_swarm_load.py      # Heavy load test for swarm validation
├── TODO.md                 # Future enhancements roadmap
├── server/
│   ├── Dockerfile
│   ├── server.py           # Main ASGI app + hostname middleware
│   ├── mcp_app.py          # FastMCP instance (stateless_http=True)
│   ├── config.py           # Configuration loader
│   ├── tools/              # Auto-discovered MCP tools
│   ├── resources/          # Auto-discovered MCP resources
│   └── prompts/            # Auto-discovered MCP prompts
└── README.md

🧪 Test Scripts

test_mcp_client.py

Tests MCP server using the official MCP Python SDK client:

Session initialization
Tool listing and invocation
Resource and prompt listing
Error handling

py -3.12 test_mcp_client.py

test_swarm_load.py

Heavy concurrent load test for swarm validation:

Configurable requests and concurrency
Tracks which replica handled each request
Reports distribution statistics
Response time metrics

# Default: 50 requests, 10 concurrent
py -3.12 test_swarm_load.py

# Custom load
py -3.12 test_swarm_load.py --requests 200 --concurrent 30

# Options
py -3.12 test_swarm_load.py --help

🔧 Configuration

Environment Variables

Variable	Default	Description
`MCP_PORT`	`8000`	Server port
`MCP_NAME`	`template-mcp`	Server name
`AUTH_ENABLED`	`false`	Enable Bearer token auth
`AUTH_TOKEN`	-	Required token (if auth enabled)

📚 Key Learnings

1. MCP is Stateful by Default

The protocol uses session IDs and SSE connections that tie clients to specific servers.

2. FastMCP Internal Session Manager

Even with Redis middleware, FastMCP's internal StreamableHTTPSessionManager blocks cross-replica sessions.

3. Stateless Mode is the Answer

Using stateless_http=True completely removes session tracking, enabling true horizontal scaling.

4. Trade-offs Exist

Stateless mode sacrifices some features (server push, real-time progress) for scalability.

5. X-Served-By Header

We added a middleware that returns the container hostname in the X-Served-By response header, enabling load test tracking.

🔗 References

📝 Notes

This template was created to demonstrate MCP horizontal scaling in production environments. The stateless mode is the officially recommended approach for multi-node deployments.

Bottom Line: For horizontal scaling, use stateless_http=True + json_response=True.

🎉 Template Status: PRODUCTION READY ✅

Validated on January 2, 2026:

✅ 3 replicas running in Docker Swarm
✅ 100% success rate under heavy load (300 requests, 50 concurrent)
✅ All replicas handling requests (true load balancing)
✅ 54+ requests/second throughput
✅ Official MCP client test passed
✅ Container logs confirm distributed processing

📋 TODO & Future Enhancements

See TODO.md for detailed implementation plans and tracking

Priority 1: Message Queue Integration

Implement Queue Mechanism for long-running tasks
- Problem: Stateless mode doesn't support server-push for progress updates
- Solution: Use message queue (Redis/RabbitMQ) for async task processing
- Architecture:
```
MCP Tool Call → Queue Job → Worker Pool → Poll for Results
```
- Benefits:
  - Handle long-running tasks (>30 sec)
  - Provide job status and progress tracking
  - Enable async workflows while keeping stateless HTTP
- Example Use Cases:
  - Video processing
  - Large data exports
  - ML model inference
  - Report generation

Priority 2: Monitoring & Observability

Add Prometheus metrics endpoint
Implement distributed tracing (OpenTelemetry)
Add Grafana dashboards for replica health
Request rate limiting per replica

Priority 3: Advanced Features

Implement circuit breaker for downstream services
Add request caching layer (Redis)
Auto-scaling based on CPU/memory metrics
Health check improvements with liveness/readiness probes

Priority 4: Security Enhancements

Add rate limiting per API key/user
Implement request signing/verification
Add audit logging for compliance
Secret rotation for auth tokens

🚀 Getting Started with Queue Integration

When you're ready to add queue support for long-running tasks:

Add Redis/RabbitMQ service to stack.yml
Create queue worker service (separate from MCP)
Update MCP tools to submit jobs instead of blocking
Add polling endpoint: GET /jobs/{job_id}/status
Keep stateless HTTP for instant queries, use queue for slow tasks

Hybrid Architecture:

Fast Queries (< 5 sec)  → Stateless MCP (instant response)
Slow Tasks (> 5 sec)    → Queue Worker (async + polling)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
server		server
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SPEC.md		SPEC.md
TODO.md		TODO.md
docker-compose.yml		docker-compose.yml
stack.yml		stack.yml
test_mcp_client.py		test_mcp_client.py
test_swarm_load.py		test_swarm_load.py

Folders and files

Latest commit

History

Repository files navigation

MCP Horizontal Scaling Template (Docker Swarm)

🔬 Research Summary

The Problem

What We Tried

The Solution: stateless_http=True

📊 Mode Comparison

🎯 When to Use This Template

✅ USE This Template When:

❌ DON'T Use This Template When:

💡 Real-World Use Case Examples

Example 1: Analytics MCP for Sales Dashboard

Example 2: Multi-Tenant API Gateway MCP

Example 3: File System MCP for DevOps

🎯 When to Use Each Mode

Use Stateless Mode (stateless_http=True) for:

Use Stateful Mode (default) for:

🚀 Quick Start

1. Build the Image

2. Deploy to Swarm

3. Test Load Balancing

4. Test Tool Calls

5. Run Load Test (Proof of Concept)

✅ Proof of Concept: Load Test Results

Test Configuration

Test 1: Standard Load (100 requests, 20 concurrent)

Test 2: Heavy Load (300 requests, 50 concurrent)

Key Findings

🔬 Detailed Validation with Container Logs

Docker Logs Show True Load Balancing

What This Proves:

X-Served-By Header Validation

🏗️ Architecture

Stateless Architecture (Recommended for Scaling)

📁 Project Structure

🧪 Test Scripts

test_mcp_client.py

test_swarm_load.py

🔧 Configuration

Environment Variables

📚 Key Learnings

1. MCP is Stateful by Default

2. FastMCP Internal Session Manager

3. Stateless Mode is the Answer

4. Trade-offs Exist

5. X-Served-By Header

🔗 References

📝 Notes

🎉 Template Status: PRODUCTION READY ✅

📋 TODO & Future Enhancements

Priority 1: Message Queue Integration

Priority 2: Monitoring & Observability

Priority 3: Advanced Features

Priority 4: Security Enhancements

🚀 Getting Started with Queue Integration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The Solution: `stateless_http=True`

Use Stateless Mode (`stateless_http=True`) for:

Packages