Skip to content

Releases: lablup/backend.ai

26.4.3

16 Apr 23:03
2c8419b

Choose a tag to compare

Features

  • Add scope resolver on EntityRefGQL so RBAC role scope connections can resolve the scope target (e.g., project, domain) directly in GraphQL. (#11107)
  • Add session_v2(id) GraphQL single-node query with RBAC-enforced single-session reads across GraphQL, REST v2, SDK, and CLI. (#11124)
  • Add image_id FK to kernels and image_ids to sessions for UUID-based image references (#11125)
  • Add sglang runtime variant presets (#11129)
  • Add admin API to refresh revisions for all active deployments, rebuilding each revision through DeploymentController so preset, deployment-config, and model_definition are re-resolved. Partial success is reported per deployment. (#11134)
  • Add missing filter and order fields to deployment and revision search APIs, and fix DeploymentOrders.updated_at crash. (#11154)
  • Expose deploying_revision on the ModelDeployment GraphQL node so clients can observe the revision currently being rolled out alongside current_revision. (#11156)
  • Support partial model_definition input from preset, vfolder model-definition.yaml, and request override via the new ModelDefinitionDraft type; the merged draft is resolved into the strict ModelDefinition only at the persistence boundary. (#11167)

Improvements

  • Add lazy resolver fields for FK references across GQL Node types and deprecate legacy Graphene stub resolvers with v2 dataloader-based alternatives. (#11120)
  • Make DeploymentController the single authority for revision creation and activation, ensuring consistent preset application, RBAC, deployment strategy, and concurrency guards across all API paths (v2 and legacy). (#11126)
  • Unify legacy and v2 deployment creation through DeploymentController, removing the current_revision-direct-assignment bug that bypassed the DEPLOYING strategy lifecycle on initial deploy and dropping the now-redundant CHECK_PENDING lifecycle stage. (#11167)

Fixes

  • Add retry logic to etcd_put_json in TUI installer to handle etcd not being ready after halfstack startup (#10905)
  • Add RBAC validation to v2 vfolder GET endpoint (#11062)
  • Grant project_admin_page:read / domain_admin_page:read permissions to the auto-generated admin role when a new project or domain is created. (#11074)
  • Add --wait flag to halfstack docker compose up to ensure etcd passes its healthcheck before the installer proceeds to configuration, preventing a gRPC race condition during configure_manager() (#11081)
  • Fix prometheus query preset fixture failing with "Unconsumed column names: category" by using category_name alias with FixtureReferenceSpec (#11086)
  • Fix missing enum filter handling across deployment, session, kernel, vfolder, audit-log, login-session, and login-history domains, and standardize all enum filters to support equals/in/not_equals/not_in operators consistently. (#11092)
  • Fix auto-scaling rule last_triggered_at returning fake timestamps instead of null, and add NullableDateTimeFilter for filtering nullable datetime columns (#11102)
  • Add independent cookie_secure config under [security] to set the Secure flag on session cookies, decoupled from ssl_enabled for reverse proxy SSL termination environments. (#11105)
  • Increase default health check initial delay from 5 minutes to 30 minutes for all runtime variant generators to prevent premature failures during large model loading. (#11108)
  • Automatically sync RBAC project-member role bindings when users are added to or removed from a project via modifyGroup or modifyUser. (#11116)
  • Fix project creation to also create the member system role and backfill member roles for existing projects that were missing one. (#11118)
  • Fix GQL serialization error for AgentStatus enum in agentsV2 query. (#11127)
  • Correct category_id type to UUID in QueryDefinitionGQL and add GQL query/mutation support for prometheus query preset categories. (#11130)
  • Fix alembic merge migration declaring non-head ancestors as parents, which caused alembic upgrade head to fail. (#11131)
  • Fix inflated total_count in the admin_roles GraphQL query caused by an unused LEFT JOIN on ObjectPermissionRow. (#11132)
  • Allow auto scaling rule updates to clear nullable fields (min_threshold, max_threshold, min_replicas, max_replicas, prometheus_query_preset_id) by sending an explicit null, while keeping omitted fields unchanged. (#11137)
  • Validate that scope_id is a valid UUID for USER/PROJECT scope types in RBAC adapter, preventing email addresses from being stored as scope_id (#11138)
  • Add RBAC validation to v2 session GET endpoint using SingleEntityActionProcessor (#11143)
  • Fix incorrect vLLM default values in runtime variant preset fixture to match upstream defaults (#11144)
  • Populate deployment_revisions.model_definition on both legacy endpoint creation (POST /func/services) and modify flows by running them through the unified revision merge pipeline so all sources — deployment-config.yaml, revision preset, model-definition.yaml, and request — flow through RevisionDraft. On modify, the current revision is used as the lowest-priority base so untouched fields are preserved while yaml/preset refreshes remain authoritative. (#11145)
  • Fix ./bai admin deployment revision refresh failing with a TypeError on every deployment after the revision-merge pipeline refactor. (#11148)
  • Replace the non-existent name filter on deployment revisions with a revision_number filter and ordering across the DTO, GraphQL, REST, and CLI layers (./bai deployment revision search --name-contains is replaced by --revision-number). (#11150)
  • Rename the ModelDeployment.createdUser/createdUserV2 GraphQL fields to a single creator field. (#11152)
  • Backfill missing role-to-scope mappings in association_scopes_entities for migration-created SYSTEM roles so that GraphQL scope resolution no longer returns null (#11159)
  • Fix Prometheus range query 502 errors by accepting timezone-aware datetimes or Unix timestamps in CLI execute inputs (#11163)
  • Populate revision-level fields on the legacy GQL endpoint response during the initial DEPLOYING phase by falling back to deploying_revision when current_revision is unset, expose resource_slots on the v2 revision response, and stop hard-coding cluster mode / size / runtime variant in the model-card and vfolder deploy adapters so the revision preset's values are no longer silently overridden. (#11167)

Documentation Updates

  • Add data migration testing guideline to alembic CLAUDE.md. (#10936)

Miscellaneous

  • Add a convenience script (scripts/refresh-graphql-gateway.sh) to regenerate the GraphQL schema, copy it to the project root, and optionally restart the Apollo Router gateway in one step. (#11091)

Full Changelog

Check out the full changelog until this release (26.4.3).

Full Commit Logs

Check out the full commit logs between release (26.4.2) and (26.4.3).

26.4.2

14 Apr 18:43
ed7fa62

Choose a tag to compare

Features

  • Add optional activate flag to add_model_revision API (#10468)
  • Add TCP appproxy worker installation support in dev installer (#10650)
  • Add the login_client_types table, model, data dataclass, and repository so administrators can register and manage login client types at runtime. (#10822)
  • Add owner_id (delegated user UUID) to EnqueueSessionInput for delegated session ownership when enqueuing v2 sessions. (#10845)
  • Expose the login_client_types entity via the Strawberry GraphQL schema: loginClientType(id) single query, loginClientTypes Connection query with filter/order/pagination, and createLoginClientType / updateLoginClientType / deleteLoginClientType mutations. (#10876)
  • Add ./bai CLI v2 commands for login_client_types: ./bai login-client-type list/get (any authenticated user) and ./bai admin login-client-type create/update/delete (super admin only). (#10878)
  • Add --otel-endpoint and --metric-access-cidr options to TUI installer, configure announce-addr for manager/agent/storage-proxy, and add [otel] blocks to app-proxy halfstack configs (#10880)
  • Add vLLM runtime variant preset fixtures with automatic runtime_variant_name FK resolution in fixture populate (#10889)
  • Add the login_client_type service layer, v2 DTOs, and an admin-only search path (LoginClientTypeAdminRepository / LoginClientTypeAdminService / LoginClientTypeAdminProcessors) with filtering, ordering, and pagination support via BatchQuerier. (#10923)
  • Add REST v2 CRUD endpoints for the login_client_types entity at /v2/login-client-types/, including a /v2/login-client-types/search endpoint with filtering, ordering, and pagination support. (#10924)
  • Replace the hard-coded LoginClientType enum with a foreign-key reference to the login_client_types table in login sessions, allowing administrators to manage client types dynamically. (#10925)
  • Add Client SDK v2 domain client and CLI v2 commands for the login_client_types entity: ./bai login-client-type get, ./bai admin login-client-type search/create/update/delete. (#10942)
  • Add PROMETHEUS auto-scaling metric source that queries Prometheus directly via query presets, with bidirectional scaling support (scale-out/in thresholds in a single rule). (#10993)
  • Add user_id filter to login session admin search and admin_unblock_user API to clear failed-login rate limit blocks (#11011)
  • Add creator_id column to vfolders and wire VFolder ownership GQL resolvers (user, project, creator) to DataLoaders for proper entity resolution. (#11018)
  • Add deployment-scoped Prometheus query presets with category system, description, rank, and vLLM example fixtures (#11072)

Improvements

  • Delete login session rows on termination and record full session lifecycle events in login history (#11013)
  • Add explicit LabelMatcher to Prometheus query presets to support regex matching operators (#11025)

Fixes

  • Rename TooManyConcurrentLoginSessions error type from too-many-concurrent-logins to active-login-session-exists to match actual error semantics (#5691)
  • Fix imagify API handler that incorrectly parsed POST body as query parameters by switching from QueryParam to BodyParam (#5694)
  • Return HTTP 409 (Conflict) instead of 429 (Too Many Requests) for TooManyConcurrentLoginSessions error (#10992)
  • Re-read model definition from vfolder when legacy modify_endpoint creates a new revision, so on-disk file changes are reflected. Also trigger CHECK_REPLICA lifecycle on revision-level field changes to notify the deployment controller. (#10994)
  • Fix OIDC AUTHORIZE hook to read sToken from hook params before falling back to cookies, enabling token-login flow via JSON body. (#11002)
  • Fix GET /stream/session/{name}/execute 500 error by sharing a single PrivateContext between the stream handler and its lifecycle hook, so stream_execute_handlers is initialized on the instance the handler reads at request time. (#11003)
  • Fix per-container CUDA metric collection failing due to missing container.show() call in gather_container_measures (#11006)
  • Fix double /func/ prefix in session-mode GQL path causing HTTP 404 (#11007)
  • Fix 500 Internal Server Error when creating a session with an invalid or non-member project group by replacing plain ValueError with proper BackendAIError subclasses in query_userinfo(). (#11012)
  • Fix RBAC action validators silently bypassing permission denials; legacy processor paths now observe denials via log and metric instead of raising. (#11014)
  • Fix TERMINATED transition hook blocking session termination when model-definition.yaml is missing from storage for custom-runtime inference sessions. (#11019)
  • Fix endpoint destroy failing with UniqueViolationError on ix_endpoints_unique_name_when_not_destroyed by narrowing the partial unique index predicate to exclude DESTROYING/DESTROYED states. (#11020)
  • Make client_type_id optional in AuthorizeRequest so clients that do not specify a login client type (e.g., WebUI) can still authenticate, and add the missing migration for the login_client_type_id column on the login_sessions table. (#11022)
  • Fix GQL user adapter to handle not_equals and not_in operations in status and role filter conversion, which were previously silently ignored. (#11024)
  • Fix route health initial_delay calculation to use running_at instead of route creation time, preventing premature session termination for custom runtime variants with long model loading times. (#11029)
  • Add missing server_default to images.last_used_at column so that new image rows without an explicit last_used_at value no longer violate the NOT NULL constraint. (#11031)
  • Fix endpoint status to reflect route health check results instead of only lifecycle status (#11033)
  • Set Secure flag on session cookie when SSL is enabled. (#11035)
  • Fix Pydantic validation error when using orderBy in deployment-related GraphQL queries (autoScalingRules, deployments, replicas, accessTokens) (#11037)
  • Fix Prometheus metrics silently missing on Linux by separating the multiprocess setup module to prevent import-time ValueClass misfire. (#11038)
  • Fix orphan login_sessions rows after WebUI logout when authenticated via the keypair (sToken) login flow. (#11042)
  • Fix TypeError in TOTP hook during stoken login by using attribute access on the user Row object. (#11064)
  • Handle null HostConfig.DeviceRequests from Docker API in CUDA container measures to prevent TypeError. (#11070)
  • Bypass RBAC permission checks for superadmin users in all action validators so superadmin operations (e.g. project creation) no longer fail with NotEnoughPermission. (#11071)
  • Add RBAC validation to deployment get/update/destroy, fix keypair resource policy lookup by wrong column, and move resource-group CLI commands to admin scope (#11076)

Test Updates

  • Add component test verifying that exceeding max_concurrent_logins returns HTTP 409 Conflict (#10997)

Full Changelog

Check out the full changelog until this release (26.4.2).

Full Commit Logs

Check out the full commit logs between release (26.4.1) and (26.4.2).

26.4.1

10 Apr 13:27
5250cb4

Choose a tag to compare

Fixes

  • Fix Pydantic validation error when creating ModelCard with null framework, label, or accessLevel fields via GraphQL (#10921)
  • Fix model service creation failing with Pydantic validation error when using fractional cuda.shares resource values (e.g., 2.5) (#10929)
  • Fix backfill migration referencing dropped permission_groups table; use denormalized permissions schema instead. (#10933)
  • Fix migration failure on BinarySize-suffixed resource_slots values (e.g. "32g", "4m"). (#10934)
  • Fix superadmin unable to see other users' vfolders via vfolder_nodes GQL query due to empty ADMIN_PERMISSIONS (#10939)
  • Skip event deserialization in event dispatcher when no consumer or subscriber is registered, preventing ModuleNotFoundError in appproxy coordinator (#10941)
  • Fix legacy GQL endpoint resolvers crashing when routings is empty by using is not None check instead of truthiness check, and add missing load_routes in load_all. (#10948)
  • Fix CLI v2 RuntimeError: no running event loop crash on aiohttp >= 3.13 by deferring CookieJar creation to an async context (#10954)
  • Fix IndexError in health check handlers caused by incompatible web.Request annotation in _wrap_api_handler; now use RequestCtx parameter type. (#10958)
  • Fix ModelDefinition.merge() corrupting start_command via index-based list merging by replacing deep_merge() with Pydantic-aware field-by-field merge functions (#10959)
  • Normalize None routings to empty list in endpoint to_data() and from_dto() to fix NoneType iteration crashes (#10965)
  • Fix GQL my_client_ip returning the hive-gateway proxy IP by forwarding the X-Forwarded-For header from hive-gateway to manager subgraph requests (note: allowed_client_ip configurations that whitelisted the hive-gateway IP as a workaround should be reviewed, as the manager will now see the real client IP via GQL) (#10966)
  • Expose AND/OR/NOT composition on the ModelCardV2Filter GraphQL input so composed filter queries no longer fail with Field "AND" is not defined (#10970)

Full Changelog

Check out the full changelog until this release (26.4.1).

Full Commit Logs

Check out the full commit logs between release (26.4.0) and (26.4.1).

26.4.0

09 Apr 07:13
d0705ac

Choose a tag to compare

Features

v2 API, SDK & CLI

Delivered REST v2 endpoints for all 26 API domains, migrated GraphQL to Strawberry-backed Pydantic types with PydanticNodeMixin and domain Adapters, and added the v2 client SDK and CLI with entity-command structure covering admin CRUD, user self-service, and raw GraphQL operations.

  • Add DataLoader for batched role assignment queries by user ID and my_roles field on UserV2 to prevent N+1 queries. (#9552)
  • Migrate AuditLog GraphQL API to Strawberry with cursor-based pagination and filtering support (#10065)
  • Add Strawberry GraphQL node type for ContainerRegistry to support RBAC entity resolution (#10093)
  • Add activeResourceOverview GraphQL field to Domain and Project types, exposing currently occupied resource slots and active session count. (#10095)
  • Add AND, OR, NOT logical operators to GraphQL filter types for complex boolean filter expressions. (#10250)
  • Migrate GraphQL layer to Pydantic-backed types by introducing PydanticNodeMixin, domain Adapters, and @strawberry.experimental.pydantic.input across all GQL domains. (#10299)
  • Add update_deployment_policy GQL mutation (#10300)
  • Add execute_bulk_purger_partial() function to support partial failure handling for bulk delete operations with savepoint-based transaction isolation (#10332)
  • Add UUID-based single-entity User CRUD (create/update/delete/purge) to the GraphQL v2 API, resolving six previously stubbed mutations. (#10403)
  • Add my_keypairs GraphQL query to list the current user's keypairs with filter, orderBy, and cursor/offset pagination support. (#10404)
  • Add options field to PurgeUserV2Input to control purge behavior (migrate shared vfolders, delegate endpoint ownership). (#10498)
  • Add REST v2 API endpoints for all 26 domains under the /v2/ prefix, reusing existing v2 DTO adapters shared with GraphQL. (#10499)
  • Add v2 client SDK and CLI with [admin] {entity} [{sub-entity}] {operation} command structure, ~/.backend.ai/ config system, and ./bai shortcut for all 26 domains. (#10504)
  • Add admin CRUD mutations to v2 API for Domain, Project, ContainerRegistry, and Image entities with full stack coverage (Adapter, REST v2, SDK v2, CLI v2, GQL) (#10516)
  • Add ./bai gql CLI command and SDK client for sending raw GraphQL queries, supporting both legacy and Strawberry schemas. (#10539)
  • Add VFolder adapter with admin search implementation including filter, order, and pagination support (#10569)
  • Add v2 session REST API with enqueue, search (admin/my/project-scoped), get, terminate (batch), start/shutdown-service, logs, and update endpoints (#10599)
  • Define VFolder Strawberry GQL node and nested field group types for the Graphene-to-Strawberry migration. (#10603)
  • Add VFolder filter and order-by Strawberry GQL types for v2 queries with AND/OR/NOT logical operators (#10604)
  • Add missing update, execute commands and admin CLI module for v2 prometheus-query-preset (#10606)
  • Add v2 export REST API, client SDK, and CLI commands for CSV export operations (#10609)
  • Add REST v2 and GraphQL endpoints for unassigning users from a project, with failure information for non-existent or unassigned user IDs. (#10632)
  • Add REST v2 endpoint for assigning users to projects with RBAC enforcement (#10633)
  • Add resource policy v2 API with Strawberry GQL, REST v2, SDK, and CLI for keypair/user/project resource policies, replacing JSON fields with typed structures. (#10634)
  • Add Strawberry GraphQL resolvers for container registry v2 with search, create, update, delete operations and full filter/orderBy/pagination support. (#10635)
  • Add resource group allow/disallow API for bidirectional domain and project association management with atomic add/remove in a single request (#10636)
  • Add resource preset v2 CRUD API with shared BinarySizeInput/BinarySizeInfo types for byte-size fields (#10637)
  • Add scope-based resource allocation v2 APIs with effective assignable computation and preset availability check (#10638)
  • Add keypair admin CRUD v2 API (search, get, create, update, delete) across GQL, REST, SDK, and CLI (#10640)
  • Add search_vfolders operation with repository, service, and processor layers (#10641)
  • Add search_user_vfolders operation with repository, service, and processor layers (#10642)
  • Add cloneable filter to VFolder my_search query pipeline (#10674)
  • Wire myVfolders GraphQL query resolver and register VFolderAdapter for end-to-end user vfolder search (#10677)
  • Add required role_id parameter to the assign-users-to-project API so that users receive a project role upon assignment. (#10688)
  • Add WebSocket transport (graphql-transport-ws protocol) for Strawberry GraphQL subscriptions, enabling the Hive Gateway to forward subscriptions to the manager. (#10739)
  • Add user.id (UUID) filter to ProjectUserFilter for /v2/projects/search (#10793)
  • Add project-scoped role search API (GQL, REST v2, SDK, CLI) to discover roles available within a project. (#10794)
  • Add my_storage_host_permissions query, deployVFolder mutation, admin SSH keypair management, storage_host filter for model card search, and in/not_in/i_in/i_not_in operators for StringFilter. (#10887)

Pydantic DTO v2 Models

Defined comprehensive Pydantic v2 DTO types across all 26 API domains, establishing the typed Input/Node/Payload naming convention with SENTINEL pattern for nullable-clearable update fields and full unit test coverage.

  • Add Pydantic DTO v2 model structure for RBAC Role domain with Input, Node, and Payload types, establishing conventions for future domain DTOs. (#10253)
  • Add Pydantic DTO v2 models for auth and acl domains under common/dto/manager/v2/.
    The auth module includes 9 Input models and 10 Payload models with nested sub-models (AuthCredentialInfo, TwoFactorInfo, RoleInfo, SSHKeypairInfo, PasswordChangeInfo).
    The acl module includes GetPermissionsPayload and VFolderHostPermission re-export. (#10254)
  • Add Pydantic DTO v2 models for manager API domains: config (Dotfile, BootstrapScript), etcd (ConfigKey, ResourceMetadata), system (SystemVersion), infra (ScalingGroup, ResourcePreset, Usage, Watcher, ContainerRegistry), and operations (ErrorLog, ManagerStatus, Announcement, SchedulerOps, SessionEvents). (#10255)
  • Add Pydantic DTO v2 models for scaling_group, resource_group, resource_slot, and resource_policy domains under common/dto/manager/v2/, including request (Input), response (Node/Payload) models with SENTINEL pattern, and comprehensive unit tests for all four domains. (#10256)
  • Add Pydantic DTO v2 models for event_stream, streaming, and export domains under src/ai/backend/common/dto/manager/v2/, with comprehensive unit tests for all three packages. (#10257)
  • Add Pydantic DTO v2 models for session, compute_session, and agent domains under src/ai/backend/common/dto/manager/v2/, following the Input/Node/Payload naming convention with nested sub-models for semantic field grouping. (#10258)
  • Add Pydantic DTO v2 models for user, domain, and group entities under common/dto/manager/v2/, with nested sub-models, SENTINEL pattern for nullable-clearable update fields, and comprehensive unit tests. (#10259)
  • Add Pydantic DTO v2 models for image, scheduling_history, and auto_scaling_rule domains with full unit test coverage. (#10260)
  • Add Pydantic DTO v2 models for vfolder, object_storage, quota_scope, and storage domains under ai.backend.common.dto.manager.v2, with comprehensive unit tests for each domain. ([#10261](https://github.com/lablu...
Read more

26.4.0rc1

08 Apr 13:10
68d1033

Choose a tag to compare

26.4.0rc1 Pre-release
Pre-release

Features

  • Add shell auto-completion support for Backend.AI CLI (#7021)

  • Add DataLoader for batched role assignment queries by user ID and my_roles field on UserV2 to prevent N+1 queries. (#9552)

  • Add RBAC validator infrastructure to Session actions following BEP-1048 patterns (#9624)

  • Migrate Session entities to RBAC database with entity-type permissions and AUTO scope associations (#9636)

  • Add CLI commands for prometheus query definition admin CRUD and execution (#9641)

  • Support cloning vfolders to a different quota scope by adding target_quota_scope_id parameter to the clone API. (#9741)

  • Add per-container metric collection support for CUDA devices (#9787)

  • Update ATOM plugin definition to be conformant of rebellions CDI architecture (#9788)

  • Implement Rolling Update deployment strategy (#9997)

  • Apply RBAC Creator pattern to ArtifactRevision for consistent entity creation and access control (#10021)

  • Apply RBAC validator for App config actions (#10028)

  • Apply RBAC validators to project (group) action processors for proper permission enforcement (#10029)

  • Apply RBAC validator for Model Artifact Registry actions (#10032)

  • Apply RBAC permission validators to model deployment service actions (#10033)

  • Apply RBAC validator for Keypair actions to enforce permission checks on create, get, update, delete, and purge operations (#10051)

  • Apply RBAC validator for User actions following the established pattern from Group, VFolder, and Session services (#10055)

  • Apply RBAC validators to Image service actions for proper authorization checks (#10059)

  • Migrate AuditLog GraphQL API to Strawberry with cursor-based pagination and filtering support (#10065)

  • Add self-service keypair issue/revoke/switch GraphQL mutations (#10066)

  • Add self-service IP allowlist mutation with lockout prevention (#10067)

  • Seed built-in container utilization metric query presets (gauge, rate, diff) previously hardcoded in ContainerUtilizationMetricService as configurable DB fixtures and Alembic data migration (#10090)

  • Add Strawberry GraphQL node type for ContainerRegistry to support RBAC entity resolution (#10093)

  • Add activeResourceOverview GraphQL field to Domain and Project types, exposing currently occupied resource slots and active session count. (#10095)

  • Add AND, OR, NOT logical operators to GraphQL filter types for complex boolean filter expressions. (#10250)

  • Add Pydantic DTO v2 model structure for RBAC Role domain with Input, Node, and Payload types, establishing conventions for future domain DTOs. (#10253)

  • Add Pydantic DTO v2 models for auth and acl domains under common/dto/manager/v2/.
    The auth module includes 9 Input models and 10 Payload models with nested sub-models (AuthCredentialInfo, TwoFactorInfo, RoleInfo, SSHKeypairInfo, PasswordChangeInfo).
    The acl module includes GetPermissionsPayload and VFolderHostPermission re-export. (#10254)

  • Add Pydantic DTO v2 models for manager API domains: config (Dotfile, BootstrapScript), etcd (ConfigKey, ResourceMetadata), system (SystemVersion), infra (ScalingGroup, ResourcePreset, Usage, Watcher, ContainerRegistry), and operations (ErrorLog, ManagerStatus, Announcement, SchedulerOps, SessionEvents). (#10255)

  • Add Pydantic DTO v2 models for scaling_group, resource_group, resource_slot, and resource_policy domains under common/dto/manager/v2/, including request (Input), response (Node/Payload) models with SENTINEL pattern, and comprehensive unit tests for all four domains. (#10256)

  • Add Pydantic DTO v2 models for event_stream, streaming, and export domains under src/ai/backend/common/dto/manager/v2/, with comprehensive unit tests for all three packages. (#10257)

  • Add Pydantic DTO v2 models for session, compute_session, and agent domains under src/ai/backend/common/dto/manager/v2/, following the Input/Node/Payload naming convention with nested sub-models for semantic field grouping. (#10258)

  • Add Pydantic DTO v2 models for user, domain, and group entities under common/dto/manager/v2/, with nested sub-models, SENTINEL pattern for nullable-clearable update fields, and comprehensive unit tests. (#10259)

  • Add Pydantic DTO v2 models for image, scheduling_history, and auto_scaling_rule domains with full unit test coverage. (#10260)

  • Add Pydantic DTO v2 models for vfolder, object_storage, quota_scope, and storage domains under ai.backend.common.dto.manager.v2, with comprehensive unit tests for each domain. (#10261)

  • Add Pydantic DTO v2 models for artifact, artifact_registry, and container_registry domains under common/dto/manager/v2/, including typed Input, Node, and Payload models with full unit test coverage. (#10262)

  • Add Pydantic DTO v2 models (types.py, request.py, response.py, __init__.py) for deployment, model_serving, and service_catalog manager API domains, with comprehensive unit tests. (#10263)

  • Add Pydantic DTO v2 models for notification, error_log, fair_share, and prometheus_query_preset domains under src/ai/backend/common/dto/manager/v2/, with comprehensive unit tests for each domain. (#10264)

  • Use deploying-revision image for new route session creation (#10271)

  • Integrate pyinfra deployment framework from backend.ai-installer into the unified install package, enabling production deployment via PyInfra alongside existing Docker-based development setup.

    Key additions:

    • PyInfra framework (runner, configs, os_packages) with enterprise config schemas (enabled=False in OSS)
    • OSS deploy scripts (os, halfstack, cores, monitor) - 318 files, 82K+ lines
    • TUI PACKAGE mode now offers choice: Release Package (existing) or Production Deployment (PyInfra)
    • Horizontal card layout with keyboard navigation for deployment type selection (#10275)
  • Consolidate deploying handlers and remove unused sub-steps (#10276)

  • Migrate GraphQL layer to Pydantic-backed types by introducing PydanticNodeMixin, domain Adapters, and @strawberry.experimental.pydantic.input across all GQL domains. (#10299)

  • Add update_deployment_policy GQL mutation (#10300)

  • Add internal health endpoint (/health) to the manager's internal app, and simplify the public health handler to a plain liveness probe. (#10308)

  • Add update_my_keypair GQL mutation to allow users to toggle their keypair's active state (is_active) (#10309)

  • Support resolving session entities in RBAC entity and permission scope queries (#10320)

  • Add execute_bulk_purger_partial() function to support partial failure handling for bulk delete operations with savepoint-based transaction isolation (#10332)

  • Add PROJECT_ADMIN_PAGE and DOMAIN_ADMIN_PAGE guarded RBAC entities for admin page access control. (#10334)

  • Add repository-layer support for filtering role assignments by permissions via PermissionConditions and exists_permission_combined (#10397)

  • Add UUID-based single-entity User CRUD (create/update/delete/purge) to the GraphQL v2 API, resolving six previously stubbed mutations. (#10403)

  • Add my_keypairs GraphQL query to list the current user's keypairs with filter, orderBy, and cursor/offset pagination support. (#10404)

  • Remove rollback_on_failure from DB schema, API, and related code (#10410)

  • Add last_used_at real column to images table, replacing the computed subquery with a direct DB column updated o...

Read more

26.3.3

24 Mar 09:00
b78e21c

Choose a tag to compare

Fixes

  • Fix ON CONFLICT column mismatch in vfolder invitation RBAC remigration causing InvalidColumnReferenceError during alembic upgrade. (#10471)

Full Changelog

Check out the full changelog until this release (26.3.3).

Full Commit Logs

Check out the full commit logs between release (26.3.2) and (26.3.3).

26.3.2

24 Mar 06:18
1e2c0e4

Choose a tag to compare

Fixes

  • Add safe Prometheus metric wrappers to prevent mmap error propagation into business logic (#10395)
  • Remove duplicate debug field in the webserver's config.toml.j2 template (#10423)
  • Add missing OpenTelemetrySpec initialization in the manager, enabling trace and log export to the OTEL Collector. (#10439)

Full Changelog

Check out the full changelog until this release (26.3.2).

Full Commit Logs

Check out the full commit logs between release (26.3.1) and (26.3.2).

26.3.1

24 Mar 01:27
3b75351

Choose a tag to compare

Features

  • Add AND, OR, NOT logical operators to GraphQL filter types for complex boolean filter expressions. (#10250)
  • Add internal health endpoint (/health) to the manager's internal app, and simplify the public health handler to a plain liveness probe. (#10308)

Improvements

  • Add TimeoutSeconds annotated type to centralize and simplify session timeout validation in request DTOs. (#10267)

Fixes

  • Fix global container registry RBAC migration to map to project scopes instead of domain scopes (#10082)
  • Fix resource preset check returning incorrect occupancy when scaling groups have no active sessions (#10268)
  • Fix session dependency GraphQL dataloaders returning empty results due to incorrect key mapping and missing eager loading (#10280)
  • Restore db and config_provider access for webapp plugins (OpenID, TOTP) after DI refactoring by injecting them into the root app context (#10292)
  • Add otp field to AuthorizeRequest and AuthorizeAction for TOTP two-factor authentication compatibility. (#10305)
  • Exclude unmeasurable metrics from utilization idle check instead of treating stat collection failures as 0% usage (#10316)
  • Restore etcd and valkey_stat access for webapp plugins (Cloud) after DI refactoring by injecting them into the root app context (#10318)
  • Route authenticated TOTP endpoints through web_handler instead of anonymous handler (#10345)

Full Changelog

Check out the full changelog until this release (26.3.1).

Full Commit Logs

Check out the full commit logs between release (26.3.0) and (26.3.1).

25.11.4

20 Mar 03:45
9ca0c0c

Choose a tag to compare

Features

  • Execute Resource Usage Recalculation periodically (#5646)

Improvements

  • Add per-plugin timeout (120s) to gather_container_measures calls so a single hung plugin does not block stat collection from all other plugins (#9781)

Fixes

  • Fix wrong value type of Valkey client address (#5649)
  • Remove the unnecessary asyncio.Lock from StatContext as self-concurrency is already prevented by TimerDelayPolicy.CANCEL and each collect method operates on independent data structures (#9256)
  • Fix container net_rx/net_tx stats reading host namespace counters due to unchecked setns() return value (#9681)
  • Pre-validate namespace path before netstat_ns() to prevent thread pool exhaustion from hung threads on stale network namespaces (#9782)
  • Exclude unmeasurable metrics from utilization idle check instead of treating stat collection failures as 0% usage (#10316)

Full Changelog

Check out the full changelog until this release (25.11.4).

Full Commit Logs

Check out the full commit logs between release (25.11.3) and (25.11.4).

26.3.0

16 Mar 02:05
37b5045

Choose a tag to compare

Features

Client SDK v2

Delivered a complete rewrite of the Backend.AI client library as SDK v2, providing a typed async HTTP client with injectable authentication, domain-specific client classes covering every API surface area, WebSocket/SSE streaming support, and Pydantic-typed request/response DTOs across all domains.

  • Add SDK v2 base architecture with async HTTP client, injectable auth, typed exceptions, and domain client stubs (#8903)
  • Add 204 No Content response support and typed_request_no_content() method to SDK v2 base client (#8936)
  • Add WebSocket and SSE connection support to SDK v2 BackendAIClient (#8937)
  • Add binary and multipart request support (upload/download) to SDK v2 base client and restore deferred session file operations (#8952)
  • Add anonymous client to SDK v2 for unauthenticated endpoints (#9478)
  • Add extra_headers support to BackendAIAnonymousClient, allowing per-request header injection (e.g. X-Forwarded-For, X-Forwarded-Host, X-Forwarded-Proto) when proxying requests to the manager (#9594)
  • Register a persistent BackendAIClientRegistry on the webserver and use it for the update-password-no-auth API handler (#9595)
  • Implement SDK v2 AuthClient with async methods for all auth domain REST endpoints (#8913)
  • Add SDK v2 client for Config and Infrastructure domains with full async REST API coverage (#8914)
  • Add SDK v2 client for Model Serving domain with all 14 API methods (list, search, create, delete, scale, sync, routing, token, errors, runtimes) (#8915)
  • Implement SDK v2 SessionClient with typed async methods for all session management endpoints (#8916)
  • Add SDK v2 domain clients for container registry and storage endpoints (#8918)
  • Add SDK v2 VFolderClient with full virtual folder management endpoint coverage including CRUD, file operations, sharing/invitations, and admin operations (#8919)
  • Add SDK v2 client for Template and Operations domains with full CRUD support (#8912)
  • Add SDK v2 StreamingClient for WebSocket (terminal, execute, proxy) and SSE (session events, background task events) operations (#8939)
  • Add SDK v2 NotificationClient for notification channel and rule management (#8940)
  • Add FairShareClient for SDK v2 covering all fair-share scheduling endpoints including domain/project/user fair shares, usage buckets, weight management, and resource group spec operations (#8941)
  • Add SDK v2 clients for scheduling_history, agent, and compute_session domains (#8942)
  • Add RBACClient for RBAC domain in SDK v2, covering role management, role assignment, scope management, and entity management (#8944)
  • Add SDK v2 DeploymentClient covering deployment CRUD, revision management, and route operations with typed request/response DTOs (#8945)
  • Add ExportClient to SDK v2 covering CSV data export and report endpoints (#8963)
  • Add group/project CRUD and member management methods to SDK v2 GroupClient (#8964)
  • Add ArtifactClient for artifact lifecycle management in SDK v2, covering import, update, cleanup, approval/rejection, and query endpoints (#8968)
  • Add ResourcePolicyClient to SDK v2 for managing keypair, project, and user resource policies with full CRUD and search operations (#8969)
  • Add ArtifactRegistryClient to SDK v2 for registry scanning and federation operations (#8970)
  • Expand AgentClient and ComputeSessionClient in SDK v2 with detail retrieval and resource stats methods (#8971)
  • Add dedicated DTO package for ContainerRegistryClient in SDK v2 (#8983)
  • Add ACLClient with DTO for permission queries in SDK v2 (#8984)
  • Add ErrorLogClient domain client in SDK v2 for error log management (append, list, mark-cleared) (#8985)
  • Add ScalingGroupClient with DTO for scaling group queries in SDK v2 (#8986)
  • Add ObjectStorageClient to SDK v2 for presigned URL generation and bucket management (#8987)
  • Add EventStreamClient for SSE-based session and background task event streaming in SDK v2 (#8988)
  • Add Domain CRUD client to Client SDK v2 with create, get, search, update, delete, and purge operations (#9059)
  • Add System version info client (SystemClient.get_versions()) to Client SDK v2 (#9060)
  • Add QuotaScope management to Client SDK v2 with get, search, set_quota, and unset_quota operations (#9061)
  • Add Image management client to Client SDK v2 with search, get, rescan, alias, dealias, forget, and purge operations (#9062)
  • Add User admin CRUD client (create, get, search, update, delete, purge) to Client SDK v2 (#9063)
  • Add ServiceAutoScalingRule CRUD client to Client SDK v2 (#9064)

REST API Handler Migration

Migrated all 39 API handler modules from module-level functions to a unified Handler class pattern with constructor dependency injection, Pydantic DTOs, and APIResponse wrappers via RouteRegistry, establishing a consistent foundation for the v2 REST API layer.

  • Add RouteRegistry class with per-route middleware support, enabling unified route registration as a foundation for handler migration (#9407)
  • Convert auth handlers to the new ApiHandler pattern with typed parameters and automatic request extraction via RouteRegistry (#9410)
  • Integrate automatic api_handler wrapping into RouteRegistry.add(), so all registered handlers get typed parameter extraction and APIResponse conversion without explicit decorators (#9414)
  • Migrate resource handlers to constructor DI pattern with ResourceHandler class and register_routes() (#9453)
  • Migrate session handlers to the Handler class pattern with constructor DI, Pydantic DTOs, and APIResponse, keeping create_app() as a backward-compatible shim (#9455)
  • Migrate stream and events handlers to Handler class pattern with constructor DI and Pydantic DTOs (#9444)
  • Migrate userconfig, domainconfig, and groupconfig handlers to the Handler class pattern with Pydantic DTOs and APIResponse (#9445)
  • Migrate cluster_template and session_template handlers to Handler class pattern with Pydantic DTOs and APIResponse (#9446)
  • Migrate service (model serving) handlers from module-level functions to the new ApiHandler class pattern with constructor DI, Pydantic DTOs, and APIResponse (#9447)
  • Migrate scaling_group, acl, logs, and ratelimit handlers to ApiHandler class pattern with Pydantic DTOs and APIResponse (#9448)
  • Migrate admin, spec, and group handlers to Handler class pattern with constructor DI, Pydantic DTOs, and APIResponse (#9449)
  • Migrate resource and manager handlers to Handler class pattern with Pydantic DTOs and APIResponse (#9450)
  • Migrate etcd and container_registry handlers to Handler class pattern with Pydantic DTOs, APIResponse, and RouteRegistry (#9456)
  • Migrate session and deployment handlers to constructor DI pattern with backward-compatible create_app() shims (#9457)
  • Migrate RBAC handlers to constructor dependency injection pattern with register_routes() entry point, keeping create_app() as a backward-compatible shim (#9458)
  • Migrate vfolder handlers from module-level functions to VFolderHandler class with constructor DI, Pydantic DTOs, and APIResponse (#9459)
  • Migrate artifact and artifact_registry handlers ...
Read more