Skip to content

fix(lorawan): prevent permanent WOULD_BLOCK when duty-cycle backoff_time is zero#15545

Open
hallard wants to merge 2 commits intoARMmbed:masterfrom
hallard:lorawan/fix_backoff
Open

fix(lorawan): prevent permanent WOULD_BLOCK when duty-cycle backoff_time is zero#15545
hallard wants to merge 2 commits intoARMmbed:masterfrom
hallard:lorawan/fix_backoff

Conversation

@hallard
Copy link
Copy Markdown
Contributor

@hallard hallard commented Mar 6, 2026

Summary

In LoRaMac::schedule_tx(), when set_next_channel() returns LORAWAN_STATUS_DUTYCYCLE_RESTRICTED with backoff_time == 0, the original code skipped both starting the backoff timer and setting _can_cancel_tx. However, the caller (process_scheduling_state in LoRaWANStack.cpp) still set tx_ongoing = true and transitioned to DEVICE_STATE_SENDING.

This creates an unrecoverable stuck state:

  • No timer fires → on_backoff_timer_expiry() never called → handle_scheduling_failure() never called → reset_ongoing_tx() never called
  • tx_ongoing stays true permanently
  • All subsequent lorawan.send() calls return LORAWAN_STATUS_WOULD_BLOCK (-1001) forever
  • stop_sending() cannot recover the state either, because _can_cancel_tx is false, making clear_tx_pipe() return LORAWAN_STATUS_BUSY

The backoff_time == 0 case can occur due to sub-millisecond rounding when the remaining duty-cycle time is nearly zero at the moment set_next_channel() runs.

Fix

Enforce a minimum backoff of 1ms so the timer always fires, giving on_backoff_timer_expiry() a path to retry or invoke handle_scheduling_failure() to clean up the state.

// Before
case LORAWAN_STATUS_DUTYCYCLE_RESTRICTED:
    if (backoff_time != 0) {
        tr_debug("DC enforced: Transmitting in %lu ms", backoff_time);
        _can_cancel_tx = true;
        ...
        _lora_time.start(_params.timers.backoff_timer, backoff_time);
    }
    return LORAWAN_STATUS_OK;  // returns OK even when no timer was started!

// After
case LORAWAN_STATUS_DUTYCYCLE_RESTRICTED:
    if (backoff_time == 0) {
        backoff_time = 1;  // ensure timer always fires
    }
    tr_debug("DC enforced: Transmitting in %lu ms", backoff_time);
    _can_cancel_tx = true;
    ...
    _lora_time.start(_params.timers.backoff_timer, backoff_time);
    return LORAWAN_STATUS_OK;

Test plan

  • Verify normal duty-cycle restricted behaviour (non-zero backoff_time) is unchanged
  • Verify that a simulated backoff_time == 0 / DUTYCYCLE_RESTRICTED case no longer leaves tx_ongoing stuck
  • Run existing LoRaWAN unit tests: connectivity/lorawan/tests/

🤖 Generated with Claude Code

hallard and others added 2 commits March 6, 2026 10:35
…ime is zero

In schedule_tx(), when set_next_channel() returns DUTYCYCLE_RESTRICTED
with backoff_time == 0 (which can occur due to sub-millisecond rounding
when remaining duty-cycle time is nearly expired), the original code did
not start the backoff timer and did not set _can_cancel_tx. However, the
caller (process_scheduling_state) still set tx_ongoing=true and
transitioned to DEVICE_STATE_SENDING.

With no timer scheduled to retry, on_backoff_timer_expiry() never fires,
handle_scheduling_failure() is never called, and reset_ongoing_tx() is
never reached. The MAC is permanently stuck with tx_ongoing=true, causing
all subsequent lorawan.send() calls to return LORAWAN_STATUS_WOULD_BLOCK
(-1001) forever. Additionally, stop_sending() cannot recover the state
because _can_cancel_tx is false, making clear_tx_pipe() return BUSY.

Fix: enforce a minimum backoff of 1ms so the timer always fires regardless
of how small the computed remaining time is.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug 1 - LoRaMac::disconnect() does not clear tx_ongoing:
All timers (backoff, RX windows, ACK timeout) are stopped in disconnect(),
which prevents the state machine from ever calling reset_ongoing_tx(). If a
TX was in-flight at disconnect time, tx_ongoing remains true. After
reconnect, _lw_session.active becomes true again but tx_ongoing is still
true, so every subsequent lorawan.send() returns LORAWAN_STATUS_WOULD_BLOCK
(-1001) permanently. Fix: call reset_ongoing_tx(true) at end of disconnect().

Bug 2 - QoS nb_trans retry leaves tx_ongoing stuck on re-send failure:
When the network server configures nb_trans > LORAWAN_DEFAULT_QOS,
post_process_tx_no_reception() queues a new state_controller(SCHEDULING)
call via _queue->call() and returns early, leaving tx_ongoing=true from the
first TX. If the queued scheduling fires but send_ongoing_tx() fails with a
direct error (e.g. LORAWAN_STATUS_NO_CHANNEL_FOUND), process_scheduling_state
silently ignores the failure because the _queue->call() return value is
discarded, tx_ongoing stays true, and there is no path to reset_ongoing_tx().
Fix: in process_scheduling_state(), detect the case where send_ongoing_tx()
failed while tx_ongoing was already true and explicitly clean up the state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant