Skip to content

L1 to L4 design for Cross-Repository CI Relay #40

Description

@fffrog

Background

During L1 implementation #7847, we met some confusion about the Cross-Repository CI Relay, which may not be clearly specified in this RFC. Therefore, I open this issue to discuss the detailed design of L1 through L4 to improve future L2-L4 implementations. I would also like to listen to different opinions from the community.

L1

%%{init: {"theme": "base"}}%%
sequenceDiagram
    participant U as UpStream Repo
    participant W as webhook_handler
    participant R as Redis
    participant S as result_handler
    participant D as DownStream Repo


    U->>W: PR/Push event trigger
    W->>R: Get Allowlist
    W->>D: Passthrough Payload
Loading

L1 is the most basic event forwarding layer. Its core goal is to allow the upstream PyTorch repository to safely forward PR and push events to onboarded downstream repositories, without writing any status back to the upstream side.

The flow in the diagram is straightforward:

  1. An upstream PR event (opened/reopened/synchronized/closed) or push event triggers webhook_handler through the GitHub App.
  2. webhook_handler reads the Allowlist from a remote source and stores it in Redis to reduce follow-up calls, determining which downstream repositories should receive the event.
  3. The payload is then passed through directly to the downstream repositories.

For the concrete L1 implementation, see #7847.

L2

%%{init: {"theme": "base"}}%%
sequenceDiagram
    participant U as UpStream Repo
    participant W as webhook_handler
    participant R as Redis
    participant S as result_handler
    participant D as DownStream Repo
    participant H as HUD

    U->>W: PR/Push event trigger
    W->>R: Get Allowlist
    W->>D: Passthrough Payload

    rect rgb(240, 240, 240)
    Note over S, D: Creating In Progress in HUD
    D->>S: In progress call
    S->>R: Get Allowlist
    S->>H: Show in progress on HUD
    end
    rect rgb(240, 240, 240)
    Note over S, D: Updating Status in HUD
    D->>S: Completed workflow run call
    S->>R: Get Allowlist
    S->>H: Show completed on HUD
    end
Loading

L2 adds the ability to report results back to HUD on top of L1. The first half of the diagram is the same as L1: the upstream event enters webhook_handler, the Allowlist is read, and the downstream repository is triggered. The new part is the two gray sections in the second half:

  • When the downstream workflow run starts running, it sends an In progress callback to result_handler.
    1. DownStream Repo actively sends a callback request to result_handler from the first job in the workflow.
    2. After authenticating the request, result_handler reads the Allowlist to verify whether the request from this DownStream Repo belongs to L2 or above.
    3. If the request is valid, the run information triggered by the workflow is written to HUD and marked as in progress.
  • When the workflow run finishes, it sends a Completed callback.
    1. DownStream Repo actively sends a callback request to result_handler from the last job in the workflow.
    2. After authenticating the request, result_handler reads the Allowlist to verify whether the request from this DownStream Repo belongs to L2 or above.
    3. If the request is valid, the run information triggered by the workflow is written to HUD and marked as completed.

L3

%%{init: {"theme": "base"}}%%
sequenceDiagram
    participant U as UpStream Repo
    participant W as webhook_handler
    participant R as Redis
    participant RH as result_handler
    participant D as DownStream Repo
    participant H as HUD

    U->>W: PR/Push event trigger
    W->>R: Get Allowlist
    W->>D: Passthrough payload

    rect rgb(240, 240, 240)
        Note over R, D: Scenario 1: label add before workflow run create
        U->>W: PR label add
        W->>R: Cache PR label info
    end

    D->>RH: In progress workflow run call
    RH->>R: Get Allowlist<br>Cache workflow run info
    RH->>H: Show in progress workflow run on HUD

    rect rgb(240, 240, 240)
        Note over R, D: Scenario 1
        RH->>R: Find PR label info record
        RH->>U: Create PR in_progress check run
    end

    rect rgb(240, 240, 240)
        Note over R, D: Scenario 2: label add during workflow run execute
        U->>W: PR label add
        W->>R: Find workflow run info
        W->>U: Create PR in_progress check run
    end

    D->>RH: Completed workflow run call
    RH->>R: Get Allowlist<br>Update workflow run info
    RH->>H: Show completed workflow run on HUD

    rect rgb(240, 240, 240)
        Note over R, D: Scenario 1 & 2
        RH->>U: Update PR completed check run
    end

    rect rgb(240, 240, 240)
        Note over R, D: Scenario 3: label add after run complete
        U->>W: PR label add
        W->>R: Find workflow run info
        W->>U: Create PR completed check run
    end
Loading

L3 keeps the HUD display capability from L2 and further introduces on-demand upstream PR check runs. Consistent with the label_only design in the RFC, this layer does not attach downstream results to every PR by default. Instead, the status of the corresponding backend is shown as a non-blocking upstream check only after a label is explicitly added to the PR. The key of L3 is whether the label event or the downstream workflow run status arrives first. Because of that, both sides of the information need to be temporarily stored in Redis, and the check run is created or updated when the timing is right.

L3 has the following three scenarios:

  • Scenario 1 means the label arrives before the workflow run:
    1. webhook_handler first caches the label information in Redis.
    2. The downstream workflow run starts and calls back to result_handler. After finding the matching label record in the cache, result_handler immediately creates an in_progress check run on the upstream PR.
    3. After the workflow run completes and DownStream Repo sends the completed callback to result_handler, result_handler updates both the workflow run status in Redis and the check run status on the PR.
  • Scenario 2 means the workflow run is already executing and the label arrives later:
    1. When result_handler receives the in progress callback from DownStream Repo, it first caches the workflow run information.
    2. After the user adds a label to the PR, webhook_handler looks up that workflow run record in reverse and backfills an in_progress check run.
    3. After the workflow run completes and DownStream Repo sends the completed callback to result_handler, result_handler updates both the workflow run status in Redis and the check run status on the PR.
  • Scenario 3 is the later case where the downstream workflow run has already completed before the user adds the label:
    1. In this case, webhook_handler directly creates a completed check run based on the workflow run result already stored in Redis, without re-triggering execution.
    2. If the record for that workflow run has already been removed from Redis, the check run will not be created.

Note:
The Redis cache TTL is tentatively set to 3 hours to align with the workflow integration requirements, so Redis data will not grow indefinitely.

L4

The L4 scenario is the same as L3 Scenario 1. The difference is that L3 requires a label to trigger, while L4 triggers by default without requiring a label.

Check run

%%{init: {"theme": "base"}}%%
sequenceDiagram
    participant U as UpStream Repo
    participant W as webhook_handler
    participant R as Redis
    participant RH as result_handler
    participant D as DownStream Repo
    participant H as HUD

    U->>W: Re-run workflow run trigger
    W->>R: Get Allowlist
    W->>D: Re-run dispatch
    D->>RH: In progress workflow run call
    RH->>R: Get Allowlist
    RH->>U: Update in progress check run
    RH->>H: Update in progress workflow run in HUD
    D->>RH: Completed workflow run call
    RH->>R: Get Allowlist
    RH->>U: Update completed check run
    RH->>H: Update completed workflow run on HUD
Loading

This diagram focuses on the re-run scenario:

  1. After the upstream side triggers a workflow run re-run request from a check run, webhook_handler first reads the Allowlist and then dispatches the re-run request to the downstream side.
  2. After the downstream side re-runs the workflow run, it uses result_handler to synchronize the in progress and completed states back to the upstream check run and HUD, which is almost the same as L2 and L3.

Note:
When a check run is created, the workflow run's run_id is stored in the payload's external_id. When a re-run is triggered, the corresponding workflow run can be found by looking up the external_id in the check run payload.

cc @fffrog @KarhouTam

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions