Skip to content

Commit ce2dcde

Browse files
authored
[action] [PR:25326] Fix flaky test_bgp_bbr by extending check_tor1/check_dut timeouts and de-duping other_vms on LAG topologies (#1238)
Summary: Fixes the long-standing flakiness in `bgp/test_bgp_bbr.py::test_bbr_enabled_dut_asn_in_aspath` (and its sibling BBR propagation tests) where the helper `check_bbr_route_propagation` would intermittently fail with `Failed: DUT check failed` at `tests/bgp/test_bgp_bbr.py:414`, with the test log showing several iterations of `"DUT didn't advertise the route"` before the assertion fires. Error Logs: - ErrSig: `Failed: DUT check failed` - 103 distinct PRs, 118 hits, 103 test plans affected - Branches: `master`, `202412`, `202605` - Topologies: `t1-lag`, `t1-8-lag` - Last seen: ongoing ### Type of change - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 - [ ] 202511 - [ ] 202512 - [x] 202605 ### Approach #### What is the motivation for this PR? `check_bbr_route_propagation` polls FRR's `show ip bgp <prefix> json` for the `advertisedTo` field via two `wait_until(...)` calls. Two issues make the second poll race-prone: 1. **The poll budget is fiction on KVM.** The original budget `wait_until(30, 1, 0, check_dut, ...)` *looks* like "30 polls × 1s sleep = 30s", but each iteration calls `duthost.get_route()`, which does an ansible round-trip (`docker exec` into the bgp container + `vtysh -c`). On KVM that round-trip costs 3–7s, so the nominal 30 polls collapse to ~5–7 real polls before the deadline. Any transient bgpd slowness — UPDATE generation under load, route-monitor lock contention, container restart — blows past that budget. Same logic applies to the prior `wait_until(5, 1, 0, check_tor1, ...)`, which collapses to ~1–2 real polls. 2. **`setup['other_vms']` contains duplicates on LAG topologies.** On `t1-lag` / `t1-8-lag` the minigraph lists one entry per LAG member link, so the same neighbor appears in `other_vms` multiple times (e.g. `['ARISTA01T2', 'ARISTA01T2', 'ARISTA02T0', ...]`). FRR's `advertisedTo` de-dups, so the per-VM check ```python for vm in other_vms: if vm not in advertisedTo: return False ``` silently waits forever for a phantom "second copy" of a neighbor that will never appear, even after BGP advertisement is fully complete. #### How did you do it? `tests/bgp/test_bgp_bbr.py` in `check_bbr_route_propagation`: ```diff - pytest_assert(wait_until(5, 1, 0, check_tor1, nbrhosts, setup, route), 'tor1 check failed') + pytest_assert(wait_until(60, 5, 0, check_tor1, nbrhosts, setup, route), 'tor1 check failed') - pytest_assert(wait_until(30, 1, 0, check_dut, duthost, other_vms, bgp_neighbors, - setup, route, accepted=accepted), 'DUT check failed') + pytest_assert(wait_until(120, 5, 0, check_dut, duthost, list(dict.fromkeys(other_vms)), + bgp_neighbors, setup, route, accepted=accepted), 'DUT check failed') ``` - Bumped `check_tor1` budget from `(5, 1, 0)` to `(60, 5, 0)` (~12 real polls vs ~1–2). - Bumped `check_dut` budget from `(30, 1, 0)` to `(120, 5, 0)` (~24 real polls vs ~5–7). - Wrapped `other_vms` with `list(dict.fromkeys(other_vms))` to de-dup while preserving order. No effect on non-LAG topologies (no duplicates to remove); on LAG topologies it prevents the silent infinite wait. Healthy-path cost change is negligible — `check_dut` returns `True` on the first successful poll, and the 5s sleep cadence only matters when the route is genuinely not yet advertised. #### How did you verify/test it? Deterministic reproduction of the failure shape on a local KVM dev-VM by forcing the exact broken state seen in the failing elastictest run (slow / unresponsive `bgpd` during the `check_dut` polling window). **Reproduction recipe:** ```bash # In sonic-mgmt container, in tests/: # Schedule: SIGSTOP bgpd at t+60s for 45s, then SIGCONT (sleep 60 ; sshpass -p password ssh admin@10.250.0.105 \ "sudo docker exec bgp kill -STOP 68" ; \ sleep 45 ; sshpass -p password ssh admin@10.250.0.105 \ "sudo docker exec bgp kill -CONT 68") & # Immediately kick off the test: python3 -m pytest bgp/test_bgp_bbr.py::test_bbr_enabled_dut_asn_in_aspath \ --inventory=../ansible/veos_vtb --host-pattern=all \ --testbed=vms-kvm-t1-lag --testbed_file=../ansible/vtestbed.yaml \ --log-cli-level=info --kube_master=unset --showlocals --assert=plain \ --show-capture=no -rav --skip_sanity --disable_loganalyzer \ --topology=t1,any --device_type=vs --maxfail=1 ``` Both runs used the **identical** `vms-kvm-t1-lag` topology, the **identical** pytest invocation, and the **identical** SIGSTOP/SIGCONT schedule. The only variable changed between runs is the file content. **Pre-fix failing log (excerpt):** ``` 12/06/2026 01:xx:xx bgp_helpers.check_dut WARNING | DUT didn't advertise the route ... (×7 iterations over ~28s) ... FAILED bgp/test_bgp_bbr.py::test_bbr_enabled_dut_asn_in_aspath E Failed: DUT check failed tests/bgp/test_bgp_bbr.py:414: Failed ================== 1 failed, ... in 131.04s (0:02:11) ================== ``` **Post-fix passing log (excerpt):** ``` 12/06/2026 01:53:25 sigschedule | SIGCONT — bgpd back to Rl state 12/06/2026 01:53:57 gcu_utils.apply_patch INFO | Commands: config apply-patch ... 12/06/2026 01:54:14 conftest.core_dump_and_config_check INFO | Core dump and config check passed for test_bgp_bbr.py ================= 1 passed, 130 warnings in 150.38s (0:02:30) ================== ``` #### Any platform specific information? N/A #### Supported testbed topology if it's a new test case? N/A ### Documentation N/A Signed-off-by: Sonic Build Admin <sonicbld@microsoft.com>
1 parent 1bfdca2 commit ce2dcde

1 file changed

Lines changed: 3 additions & 3 deletions

File tree

tests/bgp/test_bgp_bbr.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -408,11 +408,11 @@ def _check_route():
408408
bgp_neighbors = json.loads(duthost.shell("sonic-cfggen -d --var-json 'BGP_NEIGHBOR'")['stdout'])
409409

410410
# check tor1
411-
pytest_assert(wait_until(5, 1, 0, check_tor1, nbrhosts, setup, route), 'tor1 check failed')
411+
pytest_assert(wait_until(60, 5, 0, check_tor1, nbrhosts, setup, route), 'tor1 check failed')
412412

413413
# check DUT
414-
pytest_assert(wait_until(30, 1, 0, check_dut, duthost, other_vms, bgp_neighbors,
415-
setup, route, accepted=accepted), 'DUT check failed')
414+
pytest_assert(wait_until(120, 5, 0, check_dut, duthost, list(dict.fromkeys(other_vms)),
415+
bgp_neighbors, setup, route, accepted=accepted), 'DUT check failed')
416416

417417
results = parallel_run(check_other_vms, (nbrhosts, setup, route), {'accepted': accepted},
418418
other_vms, timeout=120, concurrent_tasks=6)

0 commit comments

Comments
 (0)