Commit ce2dcde
authored
[action] [PR:25326] Fix flaky test_bgp_bbr by extending check_tor1/check_dut timeouts and de-duping other_vms on LAG topologies (#1238)
Summary:
Fixes the long-standing flakiness in `bgp/test_bgp_bbr.py::test_bbr_enabled_dut_asn_in_aspath` (and its sibling BBR propagation tests) where the helper `check_bbr_route_propagation` would intermittently fail with `Failed: DUT check failed` at `tests/bgp/test_bgp_bbr.py:414`, with the test log showing several iterations of `"DUT didn't advertise the route"` before the assertion fires.
Error Logs:
- ErrSig: `Failed: DUT check failed`
- 103 distinct PRs, 118 hits, 103 test plans affected
- Branches: `master`, `202412`, `202605`
- Topologies: `t1-lag`, `t1-8-lag`
- Last seen: ongoing
### Type of change
- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
- [ ] Skipped for non-supported platforms
- [ ] Test case improvement
### Back port request
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505
- [ ] 202511
- [ ] 202512
- [x] 202605
### Approach
#### What is the motivation for this PR?
`check_bbr_route_propagation` polls FRR's `show ip bgp <prefix> json` for the `advertisedTo` field via two `wait_until(...)` calls. Two issues make the second poll race-prone:
1. **The poll budget is fiction on KVM.** The original budget `wait_until(30, 1, 0, check_dut, ...)` *looks* like "30 polls × 1s sleep = 30s", but each iteration calls `duthost.get_route()`, which does an ansible round-trip (`docker exec` into the bgp container + `vtysh -c`). On KVM that round-trip costs 3–7s, so the nominal 30 polls collapse to ~5–7 real polls before the deadline. Any transient bgpd slowness — UPDATE generation under load, route-monitor lock contention, container restart — blows past that budget. Same logic applies to the prior `wait_until(5, 1, 0, check_tor1, ...)`, which collapses to ~1–2 real polls.
2. **`setup['other_vms']` contains duplicates on LAG topologies.** On `t1-lag` / `t1-8-lag` the minigraph lists one entry per LAG member link, so the same neighbor appears in `other_vms` multiple times (e.g. `['ARISTA01T2', 'ARISTA01T2', 'ARISTA02T0', ...]`). FRR's `advertisedTo` de-dups, so the per-VM check
```python
for vm in other_vms:
if vm not in advertisedTo:
return False
```
silently waits forever for a phantom "second copy" of a neighbor that will never appear, even after BGP advertisement is fully complete.
#### How did you do it?
`tests/bgp/test_bgp_bbr.py` in `check_bbr_route_propagation`:
```diff
- pytest_assert(wait_until(5, 1, 0, check_tor1, nbrhosts, setup, route), 'tor1 check failed')
+ pytest_assert(wait_until(60, 5, 0, check_tor1, nbrhosts, setup, route), 'tor1 check failed')
- pytest_assert(wait_until(30, 1, 0, check_dut, duthost, other_vms, bgp_neighbors,
- setup, route, accepted=accepted), 'DUT check failed')
+ pytest_assert(wait_until(120, 5, 0, check_dut, duthost, list(dict.fromkeys(other_vms)),
+ bgp_neighbors, setup, route, accepted=accepted), 'DUT check failed')
```
- Bumped `check_tor1` budget from `(5, 1, 0)` to `(60, 5, 0)` (~12 real polls vs ~1–2).
- Bumped `check_dut` budget from `(30, 1, 0)` to `(120, 5, 0)` (~24 real polls vs ~5–7).
- Wrapped `other_vms` with `list(dict.fromkeys(other_vms))` to de-dup while preserving order. No effect on non-LAG topologies (no duplicates to remove); on LAG topologies it prevents the silent infinite wait.
Healthy-path cost change is negligible — `check_dut` returns `True` on the first successful poll, and the 5s sleep cadence only matters when the route is genuinely not yet advertised.
#### How did you verify/test it?
Deterministic reproduction of the failure shape on a local KVM dev-VM by forcing the exact broken state seen in the failing elastictest run (slow / unresponsive `bgpd` during the `check_dut` polling window).
**Reproduction recipe:**
```bash
# In sonic-mgmt container, in tests/:
# Schedule: SIGSTOP bgpd at t+60s for 45s, then SIGCONT
(sleep 60 ; sshpass -p password ssh admin@10.250.0.105 \
"sudo docker exec bgp kill -STOP 68" ; \
sleep 45 ; sshpass -p password ssh admin@10.250.0.105 \
"sudo docker exec bgp kill -CONT 68") &
# Immediately kick off the test:
python3 -m pytest bgp/test_bgp_bbr.py::test_bbr_enabled_dut_asn_in_aspath \
--inventory=../ansible/veos_vtb --host-pattern=all \
--testbed=vms-kvm-t1-lag --testbed_file=../ansible/vtestbed.yaml \
--log-cli-level=info --kube_master=unset --showlocals --assert=plain \
--show-capture=no -rav --skip_sanity --disable_loganalyzer \
--topology=t1,any --device_type=vs --maxfail=1
```
Both runs used the **identical** `vms-kvm-t1-lag` topology, the **identical** pytest invocation, and the **identical** SIGSTOP/SIGCONT schedule. The only variable changed between runs is the file content.
**Pre-fix failing log (excerpt):**
```
12/06/2026 01:xx:xx bgp_helpers.check_dut WARNING | DUT didn't advertise the route
... (×7 iterations over ~28s) ...
FAILED bgp/test_bgp_bbr.py::test_bbr_enabled_dut_asn_in_aspath
E Failed: DUT check failed
tests/bgp/test_bgp_bbr.py:414: Failed
================== 1 failed, ... in 131.04s (0:02:11) ==================
```
**Post-fix passing log (excerpt):**
```
12/06/2026 01:53:25 sigschedule | SIGCONT — bgpd back to Rl state
12/06/2026 01:53:57 gcu_utils.apply_patch INFO | Commands: config apply-patch ...
12/06/2026 01:54:14 conftest.core_dump_and_config_check INFO | Core dump and config check passed for test_bgp_bbr.py
================= 1 passed, 130 warnings in 150.38s (0:02:30) ==================
```
#### Any platform specific information?
N/A
#### Supported testbed topology if it's a new test case?
N/A
### Documentation
N/A
Signed-off-by: Sonic Build Admin <sonicbld@microsoft.com>1 parent 1bfdca2 commit ce2dcde
1 file changed
Lines changed: 3 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
408 | 408 | | |
409 | 409 | | |
410 | 410 | | |
411 | | - | |
| 411 | + | |
412 | 412 | | |
413 | 413 | | |
414 | | - | |
415 | | - | |
| 414 | + | |
| 415 | + | |
416 | 416 | | |
417 | 417 | | |
418 | 418 | | |
| |||
0 commit comments