Implement initial L2 for CRCR by fffrog · Pull Request #43 · cosdt/test-infra

fffrog · 2026-04-13T14:43:15Z

Author

Summary

Please refer to this comment for the overall implementation.

This PR implements the L2 levels of the cross-repository CI relay described in [RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends. For the previous L1 implementation, please refer to this PR

The current implementation focuses on the first two levels defined in the RFC:

L2: downstream repos can send their CI results to PyTorch and display them in PyTorch HUD.

Higher-level behaviors for L3 and L4 are intentionally left for follow-up work.

Architecture

The relay is split into two AWS Lambda functions:

webhook lambda function (Updated)
- receives GitHub webhook PR and push events from the upstream repo
- validates webhook signatures and authenticates with AWS Secret Manager
- reads the downstream whitelist from the URL and stores it in Redis
- create a JWT token sent to the dispatch downstream repos for validation in the result lambda
- for opened/reopened/synchronized/closed actions, forwards repository_dispatch events to downstream repos
callback lambda function (Added)
- receives downstream callback payload through a public lambda function URL
- validates callback payload with JWT, OIDC, and content
- reads the downstream whitelist from the URL and stores it in Redis
- extracts CI result information from the payload and uploads to PyTorch HUD
- records queue time and execute time for evolution to L3 repo

Changes

..github/
├── workflows/
│   └── _lambda-do-release-runners.yml     # Updates the Lambda release workflow to include cross-repo-ci-relay packaging/release
│
└── actions/
    └── cross-repo-ci-relay-callback/
        └── action.yml                      # Composite action used by downstream workflows to report status back to the relay/result endpoint

aws/lambda/cross_repo_ci_relay/
├── tests/                                 # Unit tests for allowlist/config/webhook/result/redis behavior
├── README.md                              # Project overview, local development, callback flow, and result-side validation steps
├── Makefile                               # Top-level local developer entrypoint for test / deploy / clean
├── local_server.py                        # FastAPI wrapper for local end-to-end testing of both webhook and result endpoints
├── requirements.txt                       # Python dependencies required by the relay Lambdas
│
├── utils/
│   ├── allowlist.py                       # Loads, parses, and queries the downstream allowlist by rollout level
│   ├── config.py                          # Shared runtime config loading and cached get_config() helper
│   ├── gh_helper.py                       # GitHub App, repository_dispatch, and GitHub file access helpers
│   ├── hud.py                             # HUD write helpers for downstream result reporting
│   ├── jwt_helper.py                      # Helpers for minting/verifying relay callback tokens
│   ├── redis_helper.py                    # Redis helpers for allowlist cache, OOT state, and timing data
│   └── misc.py                           # Shared TypedDict definitions and HTTPException
│
├── webhook/
│   ├── Makefile                           # Build/package/deploy commands for the webhook Lambda
│   ├── lambda_function.py                 # Webhook Lambda entrypoint: verifies GitHub webhook requests and routes events
│   └── event_handler.py                   # Handles PR/push events, resolves allowlist targets, and dispatches to downstream repos
│
└── callback/
    ├── Makefile                           # Build/package/deploy commands for the result Lambda
    ├── lambda_function.py                 # Result Lambda entrypoint: verifies callback token and GitHub OIDC token
    └── callback_handler.py                  # Validates callback payloads, checks L2+ eligibility, stores state, and writes to HUD

Usage

See README.md for more details.

Verification

We performed the following scenario verification on our AWS Lambda instance:

Test with Upstream PR create/reopen/synchronize and push events triggering webhook, then redispatching to the Downstream CI (different organization) workflow.
Test with Downstream workflow send callback payload through the added action to the result lambda, then extract CI result information and send to PyTorch HUD.

Terraform configuration

pytorch/ci-infra#415

Unit Tests

Unit Tests (Mock)

Security

Callback payload carries full upstream webhook data back to HUD — action.yml builds the callback body by mutating github.event.client_payload (which contains the entire original webhook payload: PR metadata, commits, author info) and adding status/conclusion/workflow_name/workflow_url on top. This full blob is forwarded verbatim by hud.py to HUD with no relay-side filtering. HUD receives both relay-trusted verified_repo and an unvalidated body — if HUD trusts self-reported fields inside the body over verified_repo, a manipulated dispatch payload could tamper with HUD records.
Lambda callback URL is public and hardcoded — The endpoint is hardcoded in `action.yml and exposed in a public action, making it trivially discoverable. OIDC verification blocks unauthorized HUD writes, but the endpoint has no rate limiting; request flooding can cause Lambda concurrency exhaustion or Redis connection saturation.
Only OIDC is used for verification — The callback lambda relies solely on GitHub OIDC token verification for authentication, without additional application-level secrets or signatures. If an attacker compromises a downstream repo's GitHub Actions permissions, they could forge authenticated requests to the callback endpoint. Besides, OIDC has its own limitations (e.g., token expiration, potential misconfigurations) that could lead to unauthorized access if not carefully managed.

HUD Interaction

Design Principle: Transparent Relay & Decoupling
The Relay Server acts as a lightweight data passthrough layer. It does not define or parse specific CI data formats; instead, it offloads data interpretation and validation to the HUD. This ensures complete decoupling between the relay infrastructure and business-specific data.
Security & Risk Mitigation
The relay uses OIDC authentication to guarantee the authenticity of the data source (Verified Repo). Its core responsibility is to ensure the data originates from the claimed repository, while security filtering and content compliance are enforced at the HUD level.

…ns for clarity

…ctions

* update * update

fffrog changed the title ~~initial L2 for crcr~~ initial L2 for CRCR Apr 13, 2026

can-gaa-hou changed the title ~~initial L2 for CRCR~~ Implement initial L2 for CRCR Apr 14, 2026

initial L2 for crcr

ecaf2d0

can-gaa-hou force-pushed the L2 branch from 7e21083 to ecaf2d0 Compare April 14, 2026 02:50

KarhouTam and others added 5 commits April 14, 2026 03:04

update

339d5ed

Refactor tests by removing redundant cases and consolidating assertio…

2f53fff

…ns for clarity

Add blank lines for improved readability in test files

672cfb5

Handle ImportError for result_handler and event_handler in lambda fun…

5322fb0

…ctions

update (#44)

9b0fbdb

* update * update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement initial L2 for CRCR#43

Implement initial L2 for CRCR#43
fffrog wants to merge 6 commits into
mainfrom
L2

fffrog commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

fffrog commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Author

Summary

Architecture

Changes

Usage

Verification

Terraform configuration

Unit Tests

Security

HUD Interaction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fffrog commented Apr 13, 2026 •

edited

Loading