Skip to content

Add Ray Cluster and Ray Job contributing guide for open source developers#986

Open
sallycr wants to merge 1 commit intomainfrom
sally/rayjob-contributing-guide
Open

Add Ray Cluster and Ray Job contributing guide for open source developers#986
sallycr wants to merge 1 commit intomainfrom
sally/rayjob-contributing-guide

Conversation

@sallycr
Copy link
Copy Markdown
Collaborator

@sallycr sallycr commented Mar 24, 2026

Summary

This PR adds a comprehensive contributing guide for open source developers working on Ray Cluster and Ray Job controller implementations in Michelangelo.

Motivation

External contributors and new team members need clear guidance on:

  • How the Ray controller architecture works
  • Where to make code changes for different scenarios
  • How to test Ray controller modifications
  • Common contribution workflows and patterns

This guide fills that gap by providing developer-focused documentation for the Ray codebase.

What's Included

Architecture & Navigation (Lines 1-106)

  • Codebase layout with file paths for proto, controllers, plugins, activities
  • FX dependency injection pattern explanation
  • States vs Conditions - critical distinction between proto enums (RayClusterState, RayJobState) and controller lifecycle conditions
  • Prerequisites one-liner for expected knowledge

Testing Foundations (Lines 107-161)

  • Shared test infrastructure patterns (~48 lines, concise)
  • Scheme registration, fake client setup, gomock patterns
  • mockClusterCache structure (shared by both controllers)
  • Table-driven test structure overview
  • Cross-references to controller-specific examples

Ray Cluster Controller (Lines 162-310)

  • Reconciliation flow with ASCII diagram
  • Key dependencies: scheduler queue, API handler, federated client, cluster cache
  • Configuration patterns (QPS/Burst via YAML)
  • Inline testing examples - mockSchedulerQueue, controller-specific patterns
  • Cross-references Testing Foundations patterns

Ray Job Controller (Lines 311-450)

  • Reconciliation flow with cluster dependency handling
  • Key differences table comparing cluster vs job controllers
  • Terminal state handling and immutability
  • Inline testing examples - two-object setup, cluster readiness checks
  • Cross-references Testing Foundations patterns

Starlark Plugin & Activities (Lines 451-532)

  • Plugin entry point and registration
  • 3 Starlark builtins: create_cluster, terminate_cluster, create_job
  • Activities table with 7 Cadence/Temporal activities
  • Sensor patterns for polling
  • Python layer: RayTask, task.star orchestration, RayDatasetIO
  • Plugin testing patterns (separate from controller tests)
  • TODO #559 link for tracking

KubeRay Integration (Lines 533-550)

  • REST client layer (brief, file-level pointers)
  • When to modify (adding CRD types)

Common Tasks (Lines 551-619)

  • 6 step-by-step code-change workflows:
    1. Adding new RayCluster configuration option
    2. Adding new RayJob state
    3. Modifying cluster provisioning flow
    4. Adding new Starlark function
    5. Modifying Python task resources
    6. Adding KubeRay CRD types
  • Each with concrete file paths and method names

Scope

This is a developer contributing guide, not an operator guide. It focuses on:

  • ✅ Code modifications and controller patterns
  • ✅ How to add features and fix bugs
  • ✅ Testing patterns and workflows

It does NOT cover:

  • ❌ Deployment procedures
  • ❌ Operational best practices
  • ❌ End-user cluster configuration

Review Process

All team reviews completed with feedback incorporated:

Product Manager Review

  • ✅ Contributor-focused scope verified
  • ✅ Accessible to external open source contributors
  • ✅ Practical Common Tasks workflows
  • ✅ Operator doc reference updated to "(coming soon)"
  • ✅ PDF link replaced with stable cross-reference

Tech-Writer Review (11 items)

  • ✅ Passive voice converted to active imperative
  • ✅ Cross-reference pointers added to inline testing sections
  • ✅ States vs Conditions orienting sentence added
  • ✅ Key Differences table includes requeue interval
  • ✅ Common Tasks numbers removed from headings
  • ✅ Python Layer task.star description expanded
  • ✅ Prerequisites one-liner added
  • ✅ Inline comments added to code examples
  • ✅ Testing Foundations kept concise (~48 lines)

Engineer Review (Technical Accuracy)

  • ✅ All proto enum values verified correct (8 RayClusterState, 6 RayJobState)
  • ✅ All 17 file paths verified to exist
  • ✅ Reconciler structs match source code
  • ✅ Method names in reconciliation flows verified
  • ✅ Test patterns verified in both test files
  • 1 error found and fixed: Condition references (job controller uses constants, not string literals)

Structure Decision: Testing Before Controllers

The guide uses a "Testing Foundations before Controllers" structure (Modified Option A):

  • Rationale: Contributors encounter shared mock patterns before seeing them applied
  • Hybrid approach: Concise shared patterns (~48 lines) + inline controller-specific examples
  • Team vote: 3-1 in favor (engineer, tech-writer, team-lead preferred; PM preferred after)
  • PM concern addressed: Kept Testing Foundations brief to avoid front-loading

Key Design Decisions

  1. States vs Conditions as early section - Most common source of confusion for contributors
  2. Starlark Plugin as full section - Critical workflow execution path, not just implementation detail
  3. Real code snippets - All examples from actual codebase with file references
  4. Contributor-scoped throughout - No operational content, focus on code changes
  5. Comparison table - Highlights key differences between cluster and job controllers

Files Modified

  • docs/contributing/ray-contributing-guide.md (new file, 620 lines)

Testing

  • All proto enum values verified against source
  • All file paths verified to exist
  • All method names verified in controllers
  • Code examples verified in test files
  • Cross-references verified (uniflow-plugin-guide.md, how-to-write-apis.md)
  • GitHub issue link verified ([TODO] Implement Ray starlark plugin #559)

Follow-up Work

This guide focuses on Ray controllers. Future guides should cover:

  • Operator guide for deploying/operating Ray clusters (referenced as "coming soon")
  • Ray user guide for pipeline developers using RayTask

Related

  • Analysis: Engineer's technical validation verified 99.8% accuracy (1 error in 620 lines)
  • Related guides: uniflow-plugin-guide.md, how-to-write-apis.md
  • Team: 4 members (architect, engineer, tech-writer, product-manager)
  • Iterations: 3 review rounds with all feedback incorporated

Add comprehensive contributing guide for open source developers working on
Ray Cluster and Ray Job controller implementations. This guide provides:

- Architecture overview with codebase navigation map
- States vs Conditions explanation (proto enums vs controller tracking)
- Testing foundations with shared mock patterns
- Ray Cluster Controller deep dive (reconciliation, lifecycle, patterns)
- Ray Job Controller deep dive (cluster dependency, state machine)
- Starlark Plugin & Activities layer integration
- KubeRay CRD integration patterns
- Common code contribution tasks with step-by-step workflows

The guide is contributor-focused (not operator-focused), covering code
modifications, controller patterns, and testing. Includes real code snippets
with file references, ASCII flow diagrams, comparison tables, and practical
examples.

Reviewed and validated by:
- Engineer (technical accuracy - 1 error found and fixed)
- Tech-Writer (documentation quality - 11 items applied)
- Product-Manager (open source alignment - approved)

Total: 620 lines with all review feedback incorporated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant