IMPORTANT NOTICE
This is a personal open-source project created for community knowledge sharing and educational purposes only. It is not affiliated with, endorsed by, supported by, or maintained by Red Hat, Inc., its subsidiaries, or any of its employees in an official capacity. All opinions, recommendations, and content are solely those of the individual contributor(s) and do not represent the views of any employer or organization.
This project:
- Is NOT official Red Hat documentation or tooling
- Is NOT a substitute for Red Hat support, professional services, or official product documentation
- Does NOT come with any warranty, guarantee, or support commitment
- May contain inaccuracies, outdated information, or recommendations that do not apply to your environment
- References publicly available documentation and open-source upstream project sources only
Use at your own risk. Before executing any command on a production or customer cluster, you are solely responsible for understanding the impact, validating against official Red Hat documentation, ensuring change management approvals, and maintaining rollback plans. The author(s) and contributor(s) expressly disclaim all liability for any damage, data loss, service disruption, or other harm resulting from the use of this toolkit.
Red Hat and OpenShift are trademarks or registered trademarks of Red Hat, Inc. in the United States and other countries. All other trademarks are the property of their respective owners.
See the full disclaimer in each runbook for additional terms.
A collection of structured diagnostic runbooks for troubleshooting OpenShift Container Platform 4.20 EgressIP issues with OVN-Kubernetes. Designed to be used with Claude Code (Anthropic's AI coding assistant) as an interactive guide — Claude reads the runbooks and knowledge base, logs into the cluster, runs read-only diagnostic commands, and walks the operator through findings and remediation options.
What this is: Curated runbooks + a knowledge base of publicly available OCP/OVN documentation. Claude acts as an interactive assistant that follows the runbooks step-by-step — it does not make autonomous changes. All cluster modifications require explicit operator approval.
What this is NOT: An automated tool, a product, or a replacement for expertise. The operator remains in control at all times. Claude provides structure, references, and analysis — the human makes the decisions.
cd /path/to/OCP_4.20_egressIP_check
claudeBefore Claude runs any commands, it needs to understand your environment. Start by telling Claude:
Required information:
- Platform: AWS, bare metal, or vSphere?
- OCP version: Which version are you running? (e.g., 4.20.5, 4.20.21)
- What you're trying to do: Set up EgressIP for the first time, or troubleshoot a broken EgressIP?
- Cluster API URL: The API endpoint for
oc login - Credentials: Username/password or token (Claude uses these interactively and never saves them)
Helpful context (if available):
- Which namespace(s) need EgressIP?
- What EgressIP address are you planning to use (or already using)?
- What external service are you trying to reach?
- What symptom are you seeing? (e.g., "traffic stops", "wrong source IP", "intermittent drops")
- Has EgressIP ever worked on this cluster, or is this the first attempt?
- Were there any recent changes? (node reboot, upgrade, config change)
Example prompts:
Setting up EgressIP: "I need to set up EgressIP on our OCP 4.20.5 cluster on AWS. The API URL is https://api.mycluster.example.com:6443. We want namespace
productionto use EgressIP 10.0.1.100 for external traffic to our partner API."
Troubleshooting: "Our EgressIP stopped working after a node reboot. We're on OCP 4.20.5 on bare metal. The EgressIP is 192.168.1.50 assigned to worker-3. Pods in the
appnamespace can't reach external services but internal traffic is fine. API URL is https://api.ocp.internal:6443."
Based on your input, Claude will:
- Present a disclaimer — you must acknowledge before proceeding
- Select the right runbook:
- Setting up → Runbook 1 (pre-requisite check + guided setup)
- Troubleshooting → Runbook 2 (diagnostic + remediation)
- Log into the cluster and discover the actual network topology
- Walk through each step, explaining what it's checking and what the results mean
- Ask for approval before any action that modifies the cluster (marked with ⛔)
- Save all findings to
collected-data/for audit trail
OCP_4.20_egressIP_check/
│
├── CLAUDE.md # Instructions for Claude — safety rules, how to use runbooks
├── README.md # This file
│
├── runbooks/
│ ├── 01-pre-requisite-check.md # Before enabling EgressIP — validates cluster readiness
│ └── 02-diagnostic.md # When EgressIP is broken — finds root cause + fix
│
├── knowledge-base/ # Offline reference docs (for air-gapped clusters)
│ ├── official-docs/ # Red Hat product docs (OCP 4.16–4.20, ROSA, OKD)
│ ├── troubleshooting/ # Red Hat KB articles + comprehensive bug matrix
│ ├── upstream/ # OVN-Kubernetes design docs (GitHub)
│ └── community/ # Hands-on blog posts with debugging commands
│
├── templates/
│ ├── report-styles.css # Shared CSS for HTML reports
│ ├── pre-requisite-report-template.html # Runbook 1 report template
│ ├── diagnosis-summary-template.html # Runbook 2 report template
│ └── samples/ # Example outputs (open in browser)
│ ├── sample-pre-requisite-report.html
│ ├── sample-diagnosis-summary.html
│ ├── sample-current-state-diagram.html
│ └── sample-egressip-enabled-diagram.html
│
├── diagrams/ # Generated HTML topology diagrams (gitignored)
└── collected-data/ # Cluster diagnostic outputs (gitignored)
Use when: Before enabling EgressIP on a cluster, or to set up EgressIP with guided assistance.
Capabilities:
- Validates RBAC permissions (cluster-admin or minimum required permissions)
- Detects platform (AWS/bare metal/vSphere), CNI (must be OVN-Kubernetes), OCP version
- Version-specific bug assessment — checks OCP z-stream against 20+ known EgressIP bugs, flags risks, recommends upgrades
- Inventories nodes with ENI capacity (AWS), egress-assignable labels, pod subnets
- Validates node reachability probes (TCP port 9, ICMP unreachable), host firewall rules
- Checks for IP conflicts, namespace label matching, EgressFirewall deny rules, UDN interactions
- Stale OVN database inspection — detects leftover SNAT entries from previous configurations
- Generates before/after Mermaid network topology diagrams with actual cluster data
- Produces a pass/fail readiness report with all findings
- External test node setup guide — AWS EC2, existing VM, or public service options with full scripted setup
- Interactive EgressIP setup — asks 4 questions (namespace, IP, nodes, podSelector), generates exact
occommands, applies with user approval, validates OVN SNAT + external connectivity - Bold safety warnings (⛔) on every cluster/infrastructure modification with reversibility info
Safety: All mutating actions require explicit user approval. Read-only by default. Every oc command shown before execution.
Steps:
| Step | Name | What It Does |
|---|---|---|
| 0 | RBAC Pre-check | Verifies Claude has permission to run all commands. Halts if insufficient. |
| 1 | Environment Detection | Logs into cluster, detects OCP version, CNI (must be OVN-K), platform (AWS/bare metal) |
| 1b | Version Bug Assessment | Cross-references OCP z-stream against 20+ known bugs. Flags HIGH/MEDIUM/LOW risk. |
| 2 | Node Inventory | Lists all nodes, egress-assignable labels, ENI capacity (AWS), control plane guard |
| 3 | Network Topology | Captures pod/service CIDRs, existing EgressIP/EgressFirewall/UDN objects |
| 4 | Reachability Probes | Checks TCP port 9 probes, host firewall rules, reachabilityTotalTimeoutSeconds |
| 5 | Platform Checks | AWS: ENI subnet match, security groups, route tables. Bare metal: NIC config, ARP, routing |
| 6 | IP & Namespace Validation | Confirms planned EgressIP is available, namespace labels match, no conflicts |
| 7 | Generate Diagrams | Creates Mermaid diagrams of current state and projected EgressIP topology |
| 7b | External Test Node Setup | Guides setup of an EC2 (AWS) or VM (bare metal) to verify source IP changes |
| 8 | Summary Report | Pass/fail checklist, blockers, warnings, ready-to-apply commands |
| 8b | Interactive EgressIP Setup | Asks 4 questions, generates oc commands, applies with approval, validates OVN + connectivity |
| 9 | Checkpoint | Saves progress to collected-data/checkpoint.yaml for session resume |
Use when: EgressIP is enabled but external egress is broken, or traffic uses the wrong source IP.
Symptom: Pods lose external egress when EgressIP is enabled. Internal pod-to-pod traffic still works.
Capabilities:
- Quick Triage fast path (< 2 min): checks 3 most common causes — missing node label, unassigned EgressIP, ENI capacity exceeded
- Full OVN inspection: logical router policies (priority 100/102), NAT entries on gateway routers, OVS flow table
ovn-nbctl showquick overview — adapted from community blog, shows NAT state at a glance- The smoking gun: OVS priority 103 drop rule without priority 105 SNAT = silent traffic drop
- Stale SNAT / conntrack validation — cross-node duplicate detection, dead pod SNAT cleanup, conntrack zone 64000 inspection
- EgressFirewall conflict detection — catches deny rules that silently block traffic after SNAT
- Platform checks: AWS (ENI secondary IP, security groups, route tables) and bare metal (ARP table, stale GARP bug OCPBUGS-62273/65618)
- Version-specific bug correlation — maps symptom + OCP version to specific OCPBUGS with fix versions
- Connectivity comparison testing with external test node (or fallback chain: curl → wget → /dev/tcp → oc debug)
- Root cause decision tree with 11 known causes and exact remediation commands
- Cross-references knowledge base for each root cause
- OCP 4.20 pod architecture documentation — notes that
ovnkube-masterdoes not exist (replaced byovnkube-control-plane+ovnkube-node) - Correct container names:
ovnkube-controllerfor OVN commands,ovn-controllerfor OVS commands - Correct bridge name:
br-exon AWS (notbreth0from upstream docs)
Safety: Read-only diagnostics. Remediation commands (pod restart, conntrack flush) require explicit user approval with bold ⛔ warnings.
Steps:
| Step | Name | What It Does |
|---|---|---|
| 0 | RBAC Pre-check | Verifies permissions including pods/exec and pods/log in openshift-ovn-kubernetes |
| 1 | Environment Detection | Logs in, detects OCP version, CNI, platform, reachability timeout |
| 2 | Quick Triage (< 2 min) | Checks 3 most common causes: missing label, unassigned EgressIP, ENI capacity. Skips to fix if hit. |
| 3 | EgressIP CR Inspection | Dumps CR state, validates selectors, detects dual-assignment bug (OCPBUGS-59531) |
| 4 | Node Validation | Checks assigned node is Ready, not cordoned, labeled, reachability timeout |
| 5 | Controller Logs | Searches ovnkube-controller logs for reconciliation errors, allocation failures |
| 6-quick | ovn-nbctl show |
Quick NAT overview — shows all SNAT entries for the EgressIP at a glance |
| 6a | Logical Router Policies | Checks priority 102 (east-west exemption) and priority 100 (egress redirect) |
| 6b | NAT on Gateway Router | Checks SNAT entry on GR — missing SNAT is the #1 cause of egress failure |
| 6c | UDN Branch | If user-defined networks exist, checks alternate router GR_<network>_<node> |
| 6d | Southbound Flows | Verifies NB entries committed to SB database |
| 6e | OVS Flow Table | The smoking gun — checks priority 103 drop / 105 SNAT on br-ex |
| 6f | Stale SNAT / Conntrack | Cross-node SNAT duplicates, dead pod entries, conntrack zone 64000 inspection |
| 7 | EgressFirewall Check | Detects deny rules that silently block traffic after EgressIP SNAT |
| 8 | Platform Checks | AWS: ENI secondary IP, security groups. Bare metal: ARP table, stale GARP bug |
| 9 | Connectivity Testing | External test node verification with fallback chain. Comparison test table. |
| 10 | Diagnosis & Remediation | Decision tree, confidence levels, 11 root causes with exact oc fix commands |
| 11 | Checkpoint | Saves root cause, confidence, remediation status for session resume |
Pre-fetched documentation for offline/air-gapped use:
| Directory | Contents | Files |
|---|---|---|
official-docs/ |
Red Hat product docs for OCP 4.16–4.20, ROSA, OKD, API reference | 7 files |
troubleshooting/ |
Red Hat KB articles (7005481, 6247851, 7058538) + comprehensive bug matrix with 20+ known bugs mapped to fix versions | 4 files |
upstream/ |
OVN-Kubernetes EgressIP design doc, DeepWiki source walkthrough | 2 files |
community/ |
Rcarrata end-to-end blogs (OVN-K and SDN) | 2 files |
The bug matrix maps every known EgressIP bug to:
- Affected OCP versions
- Fix version (which z-stream)
- Whether a specific customer version (e.g., 4.20.5) is affected
- Diagnostic commands to detect each bug
- Remediation steps
All diagnostic outputs are saved to collected-data/ during runbook execution:
- YAML files for structured data (cluster info, node inventory, OVN internals)
- TXT files for raw logs (ovnkube-controller logs)
- MD files for human-readable reports and diagrams
Security: This directory is gitignored. All files include a sensitivity header. No authentication tokens, passwords, or kubeconfig contents are ever saved.
| Platform | Runbook 1 | Runbook 2 |
|---|---|---|
| AWS | Full support | Full support |
| Bare metal | Full support (via oc debug node/) |
Full support + GARP bug detection |
| vSphere | Basic support (similar to bare metal) | Basic support |
Tested on OCP 4.20.21. Compatible with OCP 4.12+ (OVN-Kubernetes required).
The bug matrix covers versions 4.10 through 4.20.21 with specific z-stream fix mappings.
The templates/samples/ directory contains example HTML outputs generated from a real test cluster (scrubbed of real IPs/identifiers). Open them in a browser to see what the runbooks produce:
| Sample | Description |
|---|---|
sample-pre-requisite-report.html |
Runbook 1 output — readiness checks, cluster topology with nodes/EgressIPs/OVN, traffic flow diagrams, connectivity verification |
sample-diagnosis-summary.html |
Runbook 2 output — root cause (EgressFirewall deny), diagnostic timeline, three-plane state, affected flow with failure point, remediation commands |
sample-current-state-diagram.html |
Before EgressIP — cluster topology with default SNAT-to-node-IP egress path |
sample-egressip-enabled-diagram.html |
After EgressIP — multi-namespace multi-node topology with cross-node rerouting, OVN NAT tables, connectivity results |
The templates/ directory also contains:
pre-requisite-report-template.html— HTML template with{{PLACEHOLDER}}variables for Runbook 1diagnosis-summary-template.html— HTML template with{{PLACEHOLDER}}variables for Runbook 2report-styles.css— shared CSS reference for the design language
Both runbooks were tested against a live OCP 4.20 cluster on AWS with:
- 3 namespaces, 2 worker nodes, 3 EgressIPs (multi-node multi-namespace)
- External EC2 test node for source IP verification
- Two break/fix scenarios (label removal, EgressFirewall deny)
- OVN internals inspection verified against actual cluster state
- Container names, bridge names, and command syntax corrected from testing