Skip to content

1TommyCheung/OCP_egressIP_check

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OCP 4.20 EgressIP — AI-Assisted Diagnostic Runbooks

IMPORTANT NOTICE

This is a personal open-source project created for community knowledge sharing and educational purposes only. It is not affiliated with, endorsed by, supported by, or maintained by Red Hat, Inc., its subsidiaries, or any of its employees in an official capacity. All opinions, recommendations, and content are solely those of the individual contributor(s) and do not represent the views of any employer or organization.

This project:

  • Is NOT official Red Hat documentation or tooling
  • Is NOT a substitute for Red Hat support, professional services, or official product documentation
  • Does NOT come with any warranty, guarantee, or support commitment
  • May contain inaccuracies, outdated information, or recommendations that do not apply to your environment
  • References publicly available documentation and open-source upstream project sources only

Use at your own risk. Before executing any command on a production or customer cluster, you are solely responsible for understanding the impact, validating against official Red Hat documentation, ensuring change management approvals, and maintaining rollback plans. The author(s) and contributor(s) expressly disclaim all liability for any damage, data loss, service disruption, or other harm resulting from the use of this toolkit.

Red Hat and OpenShift are trademarks or registered trademarks of Red Hat, Inc. in the United States and other countries. All other trademarks are the property of their respective owners.

See the full disclaimer in each runbook for additional terms.

A collection of structured diagnostic runbooks for troubleshooting OpenShift Container Platform 4.20 EgressIP issues with OVN-Kubernetes. Designed to be used with Claude Code (Anthropic's AI coding assistant) as an interactive guide — Claude reads the runbooks and knowledge base, logs into the cluster, runs read-only diagnostic commands, and walks the operator through findings and remediation options.

What this is: Curated runbooks + a knowledge base of publicly available OCP/OVN documentation. Claude acts as an interactive assistant that follows the runbooks step-by-step — it does not make autonomous changes. All cluster modifications require explicit operator approval.

What this is NOT: An automated tool, a product, or a replacement for expertise. The operator remains in control at all times. Claude provides structure, references, and analysis — the human makes the decisions.

How to Use

Step 1: Open a Claude Code session in this directory

cd /path/to/OCP_4.20_egressIP_check
claude

Step 2: Provide your environment background

Before Claude runs any commands, it needs to understand your environment. Start by telling Claude:

Required information:

  • Platform: AWS, bare metal, or vSphere?
  • OCP version: Which version are you running? (e.g., 4.20.5, 4.20.21)
  • What you're trying to do: Set up EgressIP for the first time, or troubleshoot a broken EgressIP?
  • Cluster API URL: The API endpoint for oc login
  • Credentials: Username/password or token (Claude uses these interactively and never saves them)

Helpful context (if available):

  • Which namespace(s) need EgressIP?
  • What EgressIP address are you planning to use (or already using)?
  • What external service are you trying to reach?
  • What symptom are you seeing? (e.g., "traffic stops", "wrong source IP", "intermittent drops")
  • Has EgressIP ever worked on this cluster, or is this the first attempt?
  • Were there any recent changes? (node reboot, upgrade, config change)

Example prompts:

Setting up EgressIP: "I need to set up EgressIP on our OCP 4.20.5 cluster on AWS. The API URL is https://api.mycluster.example.com:6443. We want namespace production to use EgressIP 10.0.1.100 for external traffic to our partner API."

Troubleshooting: "Our EgressIP stopped working after a node reboot. We're on OCP 4.20.5 on bare metal. The EgressIP is 192.168.1.50 assigned to worker-3. Pods in the app namespace can't reach external services but internal traffic is fine. API URL is https://api.ocp.internal:6443."

Step 3: Claude drives the runbook

Based on your input, Claude will:

  1. Present a disclaimer — you must acknowledge before proceeding
  2. Select the right runbook:
    • Setting up → Runbook 1 (pre-requisite check + guided setup)
    • Troubleshooting → Runbook 2 (diagnostic + remediation)
  3. Log into the cluster and discover the actual network topology
  4. Walk through each step, explaining what it's checking and what the results mean
  5. Ask for approval before any action that modifies the cluster (marked with ⛔)
  6. Save all findings to collected-data/ for audit trail

Folder Structure

OCP_4.20_egressIP_check/
│
├── CLAUDE.md                    # Instructions for Claude — safety rules, how to use runbooks
├── README.md                    # This file
│
├── runbooks/
│   ├── 01-pre-requisite-check.md   # Before enabling EgressIP — validates cluster readiness
│   └── 02-diagnostic.md            # When EgressIP is broken — finds root cause + fix
│
├── knowledge-base/              # Offline reference docs (for air-gapped clusters)
│   ├── official-docs/           # Red Hat product docs (OCP 4.16–4.20, ROSA, OKD)
│   ├── troubleshooting/         # Red Hat KB articles + comprehensive bug matrix
│   ├── upstream/                # OVN-Kubernetes design docs (GitHub)
│   └── community/               # Hands-on blog posts with debugging commands
│
├── templates/
│   ├── report-styles.css                # Shared CSS for HTML reports
│   ├── pre-requisite-report-template.html  # Runbook 1 report template
│   ├── diagnosis-summary-template.html     # Runbook 2 report template
│   └── samples/                         # Example outputs (open in browser)
│       ├── sample-pre-requisite-report.html
│       ├── sample-diagnosis-summary.html
│       ├── sample-current-state-diagram.html
│       └── sample-egressip-enabled-diagram.html
│
├── diagrams/                    # Generated HTML topology diagrams (gitignored)
└── collected-data/              # Cluster diagnostic outputs (gitignored)

Runbooks

Runbook 1: Pre-requisite Check + EgressIP Setup (runbooks/01-pre-requisite-check.md)

Use when: Before enabling EgressIP on a cluster, or to set up EgressIP with guided assistance.

Capabilities:

  • Validates RBAC permissions (cluster-admin or minimum required permissions)
  • Detects platform (AWS/bare metal/vSphere), CNI (must be OVN-Kubernetes), OCP version
  • Version-specific bug assessment — checks OCP z-stream against 20+ known EgressIP bugs, flags risks, recommends upgrades
  • Inventories nodes with ENI capacity (AWS), egress-assignable labels, pod subnets
  • Validates node reachability probes (TCP port 9, ICMP unreachable), host firewall rules
  • Checks for IP conflicts, namespace label matching, EgressFirewall deny rules, UDN interactions
  • Stale OVN database inspection — detects leftover SNAT entries from previous configurations
  • Generates before/after Mermaid network topology diagrams with actual cluster data
  • Produces a pass/fail readiness report with all findings
  • External test node setup guide — AWS EC2, existing VM, or public service options with full scripted setup
  • Interactive EgressIP setup — asks 4 questions (namespace, IP, nodes, podSelector), generates exact oc commands, applies with user approval, validates OVN SNAT + external connectivity
  • Bold safety warnings (⛔) on every cluster/infrastructure modification with reversibility info

Safety: All mutating actions require explicit user approval. Read-only by default. Every oc command shown before execution.

Steps:

Step Name What It Does
0 RBAC Pre-check Verifies Claude has permission to run all commands. Halts if insufficient.
1 Environment Detection Logs into cluster, detects OCP version, CNI (must be OVN-K), platform (AWS/bare metal)
1b Version Bug Assessment Cross-references OCP z-stream against 20+ known bugs. Flags HIGH/MEDIUM/LOW risk.
2 Node Inventory Lists all nodes, egress-assignable labels, ENI capacity (AWS), control plane guard
3 Network Topology Captures pod/service CIDRs, existing EgressIP/EgressFirewall/UDN objects
4 Reachability Probes Checks TCP port 9 probes, host firewall rules, reachabilityTotalTimeoutSeconds
5 Platform Checks AWS: ENI subnet match, security groups, route tables. Bare metal: NIC config, ARP, routing
6 IP & Namespace Validation Confirms planned EgressIP is available, namespace labels match, no conflicts
7 Generate Diagrams Creates Mermaid diagrams of current state and projected EgressIP topology
7b External Test Node Setup Guides setup of an EC2 (AWS) or VM (bare metal) to verify source IP changes
8 Summary Report Pass/fail checklist, blockers, warnings, ready-to-apply commands
8b Interactive EgressIP Setup Asks 4 questions, generates oc commands, applies with approval, validates OVN + connectivity
9 Checkpoint Saves progress to collected-data/checkpoint.yaml for session resume

Runbook 2: Diagnostic Troubleshooting (runbooks/02-diagnostic.md)

Use when: EgressIP is enabled but external egress is broken, or traffic uses the wrong source IP.

Symptom: Pods lose external egress when EgressIP is enabled. Internal pod-to-pod traffic still works.

Capabilities:

  • Quick Triage fast path (< 2 min): checks 3 most common causes — missing node label, unassigned EgressIP, ENI capacity exceeded
  • Full OVN inspection: logical router policies (priority 100/102), NAT entries on gateway routers, OVS flow table
  • ovn-nbctl show quick overview — adapted from community blog, shows NAT state at a glance
  • The smoking gun: OVS priority 103 drop rule without priority 105 SNAT = silent traffic drop
  • Stale SNAT / conntrack validation — cross-node duplicate detection, dead pod SNAT cleanup, conntrack zone 64000 inspection
  • EgressFirewall conflict detection — catches deny rules that silently block traffic after SNAT
  • Platform checks: AWS (ENI secondary IP, security groups, route tables) and bare metal (ARP table, stale GARP bug OCPBUGS-62273/65618)
  • Version-specific bug correlation — maps symptom + OCP version to specific OCPBUGS with fix versions
  • Connectivity comparison testing with external test node (or fallback chain: curl → wget → /dev/tcp → oc debug)
  • Root cause decision tree with 11 known causes and exact remediation commands
  • Cross-references knowledge base for each root cause
  • OCP 4.20 pod architecture documentation — notes that ovnkube-master does not exist (replaced by ovnkube-control-plane + ovnkube-node)
  • Correct container names: ovnkube-controller for OVN commands, ovn-controller for OVS commands
  • Correct bridge name: br-ex on AWS (not breth0 from upstream docs)

Safety: Read-only diagnostics. Remediation commands (pod restart, conntrack flush) require explicit user approval with bold ⛔ warnings.

Steps:

Step Name What It Does
0 RBAC Pre-check Verifies permissions including pods/exec and pods/log in openshift-ovn-kubernetes
1 Environment Detection Logs in, detects OCP version, CNI, platform, reachability timeout
2 Quick Triage (< 2 min) Checks 3 most common causes: missing label, unassigned EgressIP, ENI capacity. Skips to fix if hit.
3 EgressIP CR Inspection Dumps CR state, validates selectors, detects dual-assignment bug (OCPBUGS-59531)
4 Node Validation Checks assigned node is Ready, not cordoned, labeled, reachability timeout
5 Controller Logs Searches ovnkube-controller logs for reconciliation errors, allocation failures
6-quick ovn-nbctl show Quick NAT overview — shows all SNAT entries for the EgressIP at a glance
6a Logical Router Policies Checks priority 102 (east-west exemption) and priority 100 (egress redirect)
6b NAT on Gateway Router Checks SNAT entry on GR — missing SNAT is the #1 cause of egress failure
6c UDN Branch If user-defined networks exist, checks alternate router GR_<network>_<node>
6d Southbound Flows Verifies NB entries committed to SB database
6e OVS Flow Table The smoking gun — checks priority 103 drop / 105 SNAT on br-ex
6f Stale SNAT / Conntrack Cross-node SNAT duplicates, dead pod entries, conntrack zone 64000 inspection
7 EgressFirewall Check Detects deny rules that silently block traffic after EgressIP SNAT
8 Platform Checks AWS: ENI secondary IP, security groups. Bare metal: ARP table, stale GARP bug
9 Connectivity Testing External test node verification with fallback chain. Comparison test table.
10 Diagnosis & Remediation Decision tree, confidence levels, 11 root causes with exact oc fix commands
11 Checkpoint Saves root cause, confidence, remediation status for session resume

Knowledge Base

Pre-fetched documentation for offline/air-gapped use:

Directory Contents Files
official-docs/ Red Hat product docs for OCP 4.16–4.20, ROSA, OKD, API reference 7 files
troubleshooting/ Red Hat KB articles (7005481, 6247851, 7058538) + comprehensive bug matrix with 20+ known bugs mapped to fix versions 4 files
upstream/ OVN-Kubernetes EgressIP design doc, DeepWiki source walkthrough 2 files
community/ Rcarrata end-to-end blogs (OVN-K and SDN) 2 files

Key Knowledge Base File: egressip-known-bugs.md

The bug matrix maps every known EgressIP bug to:

  • Affected OCP versions
  • Fix version (which z-stream)
  • Whether a specific customer version (e.g., 4.20.5) is affected
  • Diagnostic commands to detect each bug
  • Remediation steps

Collected Data

All diagnostic outputs are saved to collected-data/ during runbook execution:

  • YAML files for structured data (cluster info, node inventory, OVN internals)
  • TXT files for raw logs (ovnkube-controller logs)
  • MD files for human-readable reports and diagrams

Security: This directory is gitignored. All files include a sensitivity header. No authentication tokens, passwords, or kubeconfig contents are ever saved.

Supported Platforms

Platform Runbook 1 Runbook 2
AWS Full support Full support
Bare metal Full support (via oc debug node/) Full support + GARP bug detection
vSphere Basic support (similar to bare metal) Basic support

OCP Version Support

Tested on OCP 4.20.21. Compatible with OCP 4.12+ (OVN-Kubernetes required).

The bug matrix covers versions 4.10 through 4.20.21 with specific z-stream fix mappings.

Sample Reports

The templates/samples/ directory contains example HTML outputs generated from a real test cluster (scrubbed of real IPs/identifiers). Open them in a browser to see what the runbooks produce:

Sample Description
sample-pre-requisite-report.html Runbook 1 output — readiness checks, cluster topology with nodes/EgressIPs/OVN, traffic flow diagrams, connectivity verification
sample-diagnosis-summary.html Runbook 2 output — root cause (EgressFirewall deny), diagnostic timeline, three-plane state, affected flow with failure point, remediation commands
sample-current-state-diagram.html Before EgressIP — cluster topology with default SNAT-to-node-IP egress path
sample-egressip-enabled-diagram.html After EgressIP — multi-namespace multi-node topology with cross-node rerouting, OVN NAT tables, connectivity results

The templates/ directory also contains:

  • pre-requisite-report-template.html — HTML template with {{PLACEHOLDER}} variables for Runbook 1
  • diagnosis-summary-template.html — HTML template with {{PLACEHOLDER}} variables for Runbook 2
  • report-styles.css — shared CSS reference for the design language

Validated

Both runbooks were tested against a live OCP 4.20 cluster on AWS with:

  • 3 namespaces, 2 worker nodes, 3 EgressIPs (multi-node multi-namespace)
  • External EC2 test node for source IP verification
  • Two break/fix scenarios (label removal, EgressFirewall deny)
  • OVN internals inspection verified against actual cluster state
  • Container names, bridge names, and command syntax corrected from testing

About

AI-Assisted Diagnostic Runbooks for OpenShift 4.20 EgressIP (OVN-Kubernetes). Curated runbooks + knowledge base designed for use with Claude Code.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors