Skip to content

[scheduler/docs] Add docs for the new scheduler module#2107

Closed
DiegoTavares wants to merge 8 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:doc-scheduler
Closed

[scheduler/docs] Add docs for the new scheduler module#2107
DiegoTavares wants to merge 8 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:doc-scheduler

Conversation

@DiegoTavares

@DiegoTavares DiegoTavares commented Dec 12, 2025

Copy link
Copy Markdown
Collaborator

Link the Issue(s) this Pull Request is related to.

Summarize your change.
[scheduler/docs] Add documentation for the new OpenCue Scheduler

Add new documentation:

  • Getting Started: docs/_docs/getting-started/deploying-scheduler.md
  • Reference: docs/_docs/reference/scheduler.md
  • News: docs/news/2025-12-12-distributed-scheduler-release.md

Update:

  • docs/README.md
  • rust/README.md

Related to PR:

This PR introduces a new module called "scheduler." This module is
responsible for the booking aspect of Cuebot and is designed to offload
this feature from the central module.

Rationale: Cuebot's booking logic depends on responding to each
HostReport with a new task that searches for layers to dispatch to the
reporting host. Consequently, each request generates a
[BookingQuery](https://github.com/AcademySoftwareFoundation/OpenCue/blob/master/cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatchQuery.java),
which significantly impacts the database. As a result, scaling Cuebot is
limited by the need to optimize database capacity to handle complex
queries. This new module alleviates the booking workload from Cuebot.

Booking on the Scheduler is not triggered by host reports; instead, it
operates through an internal loop that searches for pending jobs and
seeks suitable matches from a cached view of the hosts in the database.
The scheduler organizes layers and hosts into clusters, with each
cluster representing a group of show and allocation combinations. This
structure allows multiple instances of the scheduler to share the load
without competing for work, which is a significant issue in Cuebot.

To enable Cuebot and the Scheduler to run concurrently without competing
for work, a new feature was added to Cuebot, as detailed in
AcademySoftwareFoundation#2087. This
feature allows for the addition of an exclusion list containing show and
allocations that should not be booked, or it can halt booking for all
shows altogether.

---------

Signed-off-by: Diego Tavares <dtavares@imageworks.com>
@ramonfigueiredo

Copy link
Copy Markdown
Collaborator

Please move docs/_ docs/reference/scheduler.md to Developer Guide (docs/_ docs/developer-guide/scheduler.md), since it provide code examples and is a technical reference.

@ramonfigueiredo

Copy link
Copy Markdown
Collaborator

You only ran the extract_nav_orders.py script, which extracts the nav_order values from all Markdown files in the _docs directory and writes them to nav_order_index.txt.

The following steps are still missing:

  1. Review and, if necessary, reorder the document indices in nav_order_index.txt.
  2. Run cd docs/ && python update_nav_order.py to apply the updated nav_order values back to the Markdown files based on `nav_order

For reference see commit: c856edd

[docs] Update nav order index

  1. python extract_nav_orders.py
  2. nav_order_index.txt
  3. python update_nav_order.py


- **Slack**: Join #opencue on [ASWF Slack](https://slack.aswf.io)
- **GitHub Issues**: [Report bugs or request features](https://github.com/AcademySoftwareFoundation/OpenCue/issues)
- **Discussions**: [Community Q&A](https://github.com/AcademySoftwareFoundation/OpenCue/discussions)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Discussions**: [Community Q&A](https://github.com/AcademySoftwareFoundation/OpenCue/discussions)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +128 to +129
- `queue.manual_tags_chunk_size`: How many manual tags per cluster (default: 10)
- `queue.hostname_tags_chunk_size`: How many hostname tags per cluster (default: 10)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `queue.manual_tags_chunk_size`: How many manual tags per cluster (default: 10)
- `queue.hostname_tags_chunk_size`: How many hostname tags per cluster (default: 10)
- `queue.manual_tags_chunk_size`: How many manual tags per cluster (default: 100)
- `queue.hostname_tags_chunk_size`: How many hostname tags per cluster (default: 300)


## Code Structure

```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check and update this tree is not reflecting the current status of the data on the the branch: Diego Tavares:dock-scheduler

Or remove the Code Structure section

Comment on lines +1 to +191
---
layout: default
title: "December 12, 2025: Distributed Scheduler Release"
parent: News
nav_order: 0
---

# Distributed Scheduler Release

### A New Scalable Frame Dispatching Solution

#### December 12, 2025

---

We're excited to announce the release of the **Distributed Scheduler**, a new standalone Rust module that fundamentally reimagines how OpenCue handles frame dispatching at scale.

## The Challenge

Cuebot's traditional booking logic operates reactively: each host report triggers a booking query that searches for suitable layers to dispatch to the reporting host. This approach creates a significant database bottleneck where every host report generates a complex `BookingQuery`, and scaling the render farm becomes limited by the database's ability to handle these intensive queries. As farms grow larger, this database pressure becomes the primary constraint on system performance.

## The Solution: Distributed Scheduler

The new **Scheduler** module (`rust/crates/scheduler/`) is a complete architectural shift that offloads the booking workload from Cuebot. Instead of reacting to host reports, the scheduler operates through an **internal proactive loop** that continuously searches for pending jobs and intelligently matches them with a cached view of available hosts.

### Key Architectural Innovations

#### 1. Host Cache with In-Memory BTree Storage

The scheduler maintains a host caching system (`src/host_cache/`) that dramatically reduces database load:

- **Cached Host Statistics**: Host availability and resource information are fetched from the database and stored in memory, eliminating the need for repeated database queries during matching
- **BTree-Based Organization**: Hosts are organized in `BTreeMap` structures indexed by available cores and memory (`src/host_cache/cache.rs`), enabling efficient O(log n) lookups for resource-based matching
- **Expiration Strategy**: The cache automatically refreshes when stale, balancing freshness with performance
- **Checkout/Checkin Pattern**: Hosts are temporarily "checked out" during matching to prevent double-booking, then "checked in" when complete

#### 2. Intelligent Matching Algorithm

The matching service (`src/pipeline/matcher.rs`) implements a layer-to-host pairing system:

- **Resource-Aware Matching**: Automatically finds hosts with sufficient cores, memory, and GPU resources for each layer's requirements
- **Tag Filtering**: Validates allocation tags, manual tags, and hostname tags to ensure frames only run on appropriate hosts
- **Concurrency Control**: Uses semaphores to limit parallel matching operations and prevent resource contention
- **Metrics-Driven**: Tracks hosts attempted, wasted attempts, and candidates per layer for performance analysis

#### 3. Cluster-Based Organization

One of the scheduler's most useful features is its cluster system (`src/cluster.rs`), which organizes work by **show + allocation combinations**:

- **Cluster Isolation**: Each cluster represents a unique show/facility/allocation grouping, allowing multiple scheduler instances to work independently without competing
- **Round-Robin Processing**: Clusters are processed in a round-robin fashion with intelligent backoff when work is exhausted
- **Sleep Mechanism**: Individual clusters can be put to sleep when no work is available, reducing wasted cycles
- **Scalability Foundation**: This architecture enables horizontal scaling—different scheduler instances can handle different clusters without conflicts

**Cluster Types**:
- **Allocation Clusters**: One per facility + show + allocation tag combination
- **Manual Tags**: Grouped into chunks (configurable size) per facility
- **Hostname Tags**: Grouped into chunks (configurable size) per facility

#### 4. Comprehensive Metrics

The scheduler exposes Prometheus metrics (`src/metrics/`) for deep observability:

- `scheduler_jobs_queried_total`: Total jobs fetched from database
- `scheduler_jobs_processed_total`: Total jobs successfully processed
- `scheduler_frames_dispatched_total`: Total frames dispatched to hosts
- `scheduler_candidates_per_layer`: Distribution of hosts needed per layer
- `scheduler_time_to_book_seconds`: Latency from frame creation to dispatch
- `scheduler_job_query_duration_seconds`: Database query performance
- `scheduler_no_candidate_iterations_total`: Failed matching attempts

Access metrics at `http://[scheduler-host]:9090/metrics`

## Coexistence with Cuebot

To enable the Scheduler and Cuebot to run concurrently without competing for work, new configuration options were added to Cuebot (PR #2087):

### Cuebot Exclusion Controls

In `opencue.properties`:

```properties
# Turn off booking for ALL allocations
dispatcher.turn_off_booking=false

# Exclude specific show:facility.allocation combinations
dispatcher.exclusion_list=show1:facility.alloc1,show2:facility.alloc2
```

**Migration Strategy**:
1. Deploy the Scheduler with specific `--alloc_tags` and `--manual_tags`
2. Configure Cuebot's `dispatcher.exclusion_list` to skip those same tags
3. Monitor both systems to verify no overlap
4. Gradually migrate more clusters to the Scheduler
5. Eventually disable Cuebot booking entirely with `dispatcher.turn_off_booking=true`

## Running the Scheduler

### Prerequisites

```bash
# Install Rust and protobuf compiler
brew install protobuf # macOS
# or
sudo apt-get install protobuf-compiler # Ubuntu/Debian
```

### Build and Deploy

```bash
cd rust
cargo build --release -p scheduler

# Run scheduler for specific clusters
target/release/cue-scheduler \
--facility spi \
--alloc_tags=show1:tag1,show2:tag2 \
--manual_tags=manual_tag1,manual_tag2
```

### Configuration

The scheduler uses YAML configuration files with CLI overrides. Key settings:

- `facility`: Filter clusters to a specific facility
- `alloc_tags`: Comma-separated list of `show:tag` allocation combinations
- `manual_tags`: Comma-separated list of manual tags to process
- `queue.manual_tags_chunk_size`: How many manual tags per cluster (default: 10)
- `queue.hostname_tags_chunk_size`: How many hostname tags per cluster (default: 10)
- `queue.empty_job_cycles_before_quiting`: Exit after N idle rounds (optional)

A documented sample of the config file can be found at: `rust/config/scheduler.yaml`.

## Performance Benefits

Early testing shows significant improvements:

- **Database Load Reduction**: Fewer complex booking queries hitting the database
- **Improved Dispatch Latency**: Proactive matching reduces time-to-first-frame for new jobs
- **Horizontal Scalability**: Multiple scheduler instances can share the load by cluster
- **Better Resource Utilization**: In-memory host cache enables more sophisticated matching algorithms

## Current Limitations and Future Roadmap

### Current Version (v1.0)

- **Manual Cluster Distribution**: Operators must manually specify which clusters each scheduler instance handles via `--alloc_tags` and `--manual_tags`
- **Single Instance Recommended**: While multi-instance deployment is supported, cluster assignment is static and requires careful configuration

### Future Development

**Automatic Cluster Distribution** (Planned for 2026):
- Central control module for coordinating multiple scheduler instances
- Dynamic cluster assignment based on workload and scheduler availability
- Automatic scaling: spin up new scheduler instances as workload increases
- Self-healing: redistribute clusters when scheduler instances fail
- Load balancing: evenly distribute work across available schedulers

**Why This Matters**: The future control module will enable truly elastic scheduling—automatically scaling from a single scheduler instance during quiet periods to dozens of instances during crunch time, all without manual intervention.

## Migration Recommendation

**For v1.0**, we recommend running the Scheduler as a **single instance** to simplify deployment and avoid cluster assignment conflicts. The architecture fully supports distributed operation, but the automation layer for multi-instance coordination will arrive in a future release.

As you grow comfortable with the scheduler and your workload demands increase, you can:
1. Deploy additional instances with non-overlapping cluster assignments
2. Monitor performance and adjust cluster distribution manually
3. Prepare for the future control module that will automate this entirely

## Get Started

- **Documentation**: [Scheduler Architecture Guide](https://docs.opencue.io/docs/reference/scheduler/)
- **Source Code**: [`rust/crates/scheduler/`](https://github.com/AcademySoftwareFoundation/OpenCue/tree/new-scheduler/rust/crates/scheduler)
- **Configuration File**: [`config/scheduler.yaml`](https://github.com/AcademySoftwareFoundation/OpenCue/blob/new-scheduler/rust/config/scheduler.yaml)

## Community and Support

Have questions or feedback about the Distributed Scheduler?

- **Slack**: Join us in #opencue on [ASWF Slack](https://slack.aswf.io)
- **GitHub Discussions**: [OpenCue Discussions](https://github.com/AcademySoftwareFoundation/OpenCue/discussions)

---

The Distributed Scheduler represents a major step forward in OpenCue's evolution, enabling render farms to scale beyond previous limitations. We're excited to see how the community leverages this new architecture to build even larger and more efficient rendering pipelines.

Happy rendering!

---

[View the Release Notes](https://github.com/AcademySoftwareFoundation/OpenCue/releases) | [GitHub Repository](https://github.com/AcademySoftwareFoundation/OpenCue) | [Documentation](https://docs.opencue.io)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend avoiding file paths and low-level technical details on the News page. The Developer Guide should be the primary place for technical and implementation-specific information. The News section should remain high-level and easy for non-technical readers to understand.

You can keep:

  • The new Scheduler module
  • Host reports generating a complex booking query
  • Hosts organized using BTreeMap data structures

I recommend remove:

  • References to internal source paths and files (e.g., host_cache, pipeline, cluster implementation details)

@ramonfigueiredo ramonfigueiredo self-requested a review December 13, 2025 01:57

@ramonfigueiredo ramonfigueiredo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with minor changes

@DiegoTavares DiegoTavares changed the base branch from new-scheduler to master December 16, 2025 19:22
DiegoTavares and others added 5 commits December 16, 2025 13:20
…with CueWeb and REST Gateway (AcademySoftwareFoundation#2103)

**Link the Issue(s) this Pull Request is related to.**
- AcademySoftwareFoundation#2102

**Summarize your change.**

[sandbox/docs/cueweb/rest_gateway] Add full stack sandbox deployment
with CueWeb and REST Gateway
- Add deploy_opencue_full.sh script for one-command full stack
deployment
- Add docker-compose.full.yml with all services (db, flyway, cuebot,
rqd, rest-gateway, cueweb)
- Update cueweb/Dockerfile to use ARG for NEXT_PUBLIC_* variables
(build-time override)
- Add CueWeb reference documentation (docs/_docs/reference/cueweb.md)
- Add REST Gateway quick start guide
(docs/_docs/quick-starts/quick-start-rest-gateway.md)
- Update sandbox-testing.md with full stack deployment instructions and
desktop client tools
- Update sandbox/README.md with full stack deployment section

Services deployed by full stack:
- PostgreSQL database (port 5432)
- Flyway migrations
- Cuebot server (port 8443)
- RQD render daemon (port 8444)
- REST Gateway (port 8448)
- CueWeb UI (port 3000)
…on#2109)

Bumps [next](https://github.com/vercel/next.js) from 14.2.32 to 14.2.35.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/vercel/next.js/releases">next's
releases</a>.</em></p>
<blockquote>
<h2>v14.2.35</h2>
<p>Please see the <a
href="https://nextjs.org/blog/security-update-2025-12-11">Next.js
Security Update</a> for information about this security patch.</p>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/vercel/next.js/commit/7b940d9ce96faddb9f92ff40f5e35c34ace04eb2"><code>7b940d9</code></a>
v14.2.35</li>
<li><a
href="https://github.com/vercel/next.js/commit/7c1be85a2eb9bd704140ea0dca7a6fdf93e854a7"><code>7c1be85</code></a>
Backport <a
href="https://redirect.github.com/facebook/react/issues/35351">facebook/react#35351</a>
for 14.2.34 (<a
href="https://redirect.github.com/vercel/next.js/issues/87095">#87095</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/f3073688ce18878a674fdb9954da68e9d626a930"><code>f307368</code></a>
v14.2.34</li>
<li><a
href="https://github.com/vercel/next.js/commit/8e43882798208066d8fb4c44f9d4362bb4900a1b"><code>8e43882</code></a>
Update React Version (<a
href="https://redirect.github.com/vercel/next.js/issues/36">#36</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/385e8c286c21db9a15f4ec7bb68c8860caa08e3d"><code>385e8c2</code></a>
Backport Next.js changes to v14.2.34 (<a
href="https://redirect.github.com/vercel/next.js/issues/29">#29</a>)</li>
<li><a
href="https://github.com/vercel/next.js/commit/7a2cf51e785225c9dd94969dff80f75b41001708"><code>7a2cf51</code></a>
update version script</li>
<li><a
href="https://github.com/vercel/next.js/commit/778e7bf1211106a4a98298be219e29a28f05df10"><code>778e7bf</code></a>
lock swc binaries</li>
<li><a
href="https://github.com/vercel/next.js/commit/5a97b408c2d8668bed1642d382fc1d78ed3731cc"><code>5a97b40</code></a>
v14.2.33</li>
<li><a
href="https://github.com/vercel/next.js/commit/cb8882437c44f6d8c11f0c09ee4192afc3014a32"><code>cb88824</code></a>
backport(v14): omit searchParam data from FlightRouterState before
transport ...</li>
<li>See full diff in <a
href="https://github.com/vercel/next.js/compare/v14.2.32...v14.2.35">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=next&package-manager=npm_and_yarn&previous-version=14.2.32&new-version=14.2.35)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/AcademySoftwareFoundation/OpenCue/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Ramon Figueiredo <rfigueiredo@imageworks.com>
…ealth checks (AcademySoftwareFoundation#2110)

**Link the Issue(s) this Pull Request is related to.**
- AcademySoftwareFoundation#2111

**Summarize your change.**
[sandbox/rest_gateway] Fix Docker build compatibility for ARM64 and
health checks

- Switch REST Gateway base image from Rocky Linux 9 to
golang:1.24-bookworm (fixes DNF module YAML parsing errors on ARM64)
- Use debian:bookworm-slim for REST Gateway runtime image
- Add curl to REST Gateway image for health checks
- Fix CueWeb health check to use wget (Alpine doesn't have curl)
…ack, enhance metrics, dashboards, and documentation (AcademySoftwareFoundation#2086)

**Link the Issue(s) this Pull Request is related to.**
- AcademySoftwareFoundation#2085

**Summarize your change.**

[cuebot/pycue/proto/rust/sandbox/docs] Add event-driven monitoring stack
for OpenCue

Implement event-driven monitoring infrastructure enabling real-time and
historical analysis of render farm activity. Adds a Kafka +
Elasticsearch pipeline for collecting job, layer, frame, host, and proc
lifecycle events, with Prometheus and Grafana integration for live
dashboards and operational visibility.

Proto & Event Model:
- Define monitoring.proto with job/layer/frame/host/proc lifecycle
events
- Use proto composition pattern - embed Job, Layer, Frame, Host messages
- Exclude HostReportEvent from pipeline (too high frequency for
Kafka/ES)

Cuebot Event Publishing:
- Add KafkaEventPublisher for async event publishing to Kafka topics
- Add KafkaAdminClient for topic creation with configurable
partitions/retention
- Add MonitoringEventBuilder as Spring-managed bean for event
construction
- Hook publishing into FrameCompleteHandler, HostReportHandler,
DispatchSupportService, JobManagerSupport, DependManagerService
- Publish pickup time tracking events (FRAME_STARTED, FRAME_DISPATCHED)
- Add isFrameDispatchable() to DependDao for dependency checking

Prometheus Metrics:
- cue_frames_completed_total (with show, shot, state labels)
- cue_jobs_completed_total (with show, shot, state labels)
- cue_job_core_seconds histogram
- cue_layer_max_runtime_seconds histogram
- cue_layer_max_memory_bytes histogram

Rust monitoring-indexer Service:
- Add rust/crates/monitoring-indexer: standalone Kafka-to-Elasticsearch
indexer
- Async Kafka consumer with configurable batch processing
- Elasticsearch bulk indexing with date-based indices and field mappings
- Parallel event processing using rayon for CPU-bound operations
- Index templates for all event types (job, layer, frame, host, proc)
- Graceful handling of UnknownTopicOrPartition during startup

gRPC & PyCue:
- Add MonitoringInterface gRPC service
- Implement pycue monitoring wrapper with historical data API methods

Infrastructure (docker-compose.monitoring-full.yml):
- Zookeeper, Kafka, Kafka UI
- Elasticsearch, Kibana
- Prometheus (with cuebot scrape config)
- Grafana (with provisioned dashboard)
- monitoring-indexer service

Grafana Dashboard:
- Frame completion rates by state (DEAD/red, SUCCEEDED/green,
WAITING/yellow)
- Job completion by show
- Frame runtime and memory distribution
- Job core seconds distribution
- Pickup time metrics (FRAME_STARTED/FRAME_DISPATCHED)
- Layer max runtime/memory panels

Documentation:
- Architecture, concepts, and pipeline explanation
- Deployment and Quick Start guides
- User and Developer guides
- API Reference and tutorials
- Elasticsearch query reference guide

Utilities:
- sandbox/monitor_events.py: Example Kafka consumer
- sandbox/load_test_jobs.py: Test data generator with CLI args

Configuration (opt-in, disabled by default):
- monitoring.kafka.enabled, monitoring.kafka.bootstrap.servers
- monitoring.kafka.topic.partitions, .replication.factor, .retention.ms
- monitoring.elasticsearch.enabled, monitoring.elasticsearch.host

---------

Signed-off-by: Ramon Figueiredo <ramon.fgrd@gmail.com>
This PR introduces a new module called "scheduler." This module is
responsible for the booking aspect of Cuebot and is designed to offload
this feature from the central module.

Rationale: Cuebot's booking logic depends on responding to each
HostReport with a new task that searches for layers to dispatch to the
reporting host. Consequently, each request generates a
[BookingQuery](https://github.com/AcademySoftwareFoundation/OpenCue/blob/master/cuebot/src/main/java/com/imageworks/spcue/dao/postgres/DispatchQuery.java),
which significantly impacts the database. As a result, scaling Cuebot is
limited by the need to optimize database capacity to handle complex
queries. This new module alleviates the booking workload from Cuebot.

Booking on the Scheduler is not triggered by host reports; instead, it
operates through an internal loop that searches for pending jobs and
seeks suitable matches from a cached view of the hosts in the database.
The scheduler organizes layers and hosts into clusters, with each
cluster representing a group of show and allocation combinations. This
structure allows multiple instances of the scheduler to share the load
without competing for work, which is a significant issue in Cuebot.

To enable Cuebot and the Scheduler to run concurrently without competing
for work, a new feature was added to Cuebot, as detailed in
AcademySoftwareFoundation#2087. This
feature allows for the addition of an exclusion list containing show and
allocations that should not be booked, or it can halt booking for all
shows altogether.

---------

Signed-off-by: Diego Tavares <dtavares@imageworks.com>
@DiegoTavares

Copy link
Copy Markdown
Collaborator Author

A new PR was created to avoid conflicts since the branch this PR was branching from has been migrated to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants