Releases: lablup/backend.ai
26.4.3
Features
- Add
scoperesolver onEntityRefGQLso RBAC role scope connections can resolve the scope target (e.g., project, domain) directly in GraphQL. (#11107) - Add
session_v2(id)GraphQL single-node query with RBAC-enforced single-session reads across GraphQL, REST v2, SDK, and CLI. (#11124) - Add image_id FK to kernels and image_ids to sessions for UUID-based image references (#11125)
- Add sglang runtime variant presets (#11129)
- Add admin API to refresh revisions for all active deployments, rebuilding each revision through
DeploymentControllerso preset, deployment-config, and model_definition are re-resolved. Partial success is reported per deployment. (#11134) - Add missing filter and order fields to deployment and revision search APIs, and fix
DeploymentOrders.updated_atcrash. (#11154) - Expose
deploying_revisionon theModelDeploymentGraphQL node so clients can observe the revision currently being rolled out alongsidecurrent_revision. (#11156) - Support partial
model_definitioninput from preset, vfoldermodel-definition.yaml, and request override via the newModelDefinitionDrafttype; the merged draft is resolved into the strictModelDefinitiononly at the persistence boundary. (#11167)
Improvements
- Add lazy resolver fields for FK references across GQL Node types and deprecate legacy Graphene stub resolvers with v2 dataloader-based alternatives. (#11120)
- Make DeploymentController the single authority for revision creation and activation, ensuring consistent preset application, RBAC, deployment strategy, and concurrency guards across all API paths (v2 and legacy). (#11126)
- Unify legacy and v2 deployment creation through
DeploymentController, removing thecurrent_revision-direct-assignment bug that bypassed the DEPLOYING strategy lifecycle on initial deploy and dropping the now-redundantCHECK_PENDINGlifecycle stage. (#11167)
Fixes
- Add retry logic to
etcd_put_jsonin TUI installer to handle etcd not being ready after halfstack startup (#10905) - Add RBAC validation to v2 vfolder GET endpoint (#11062)
- Grant
project_admin_page:read/domain_admin_page:readpermissions to the auto-generated admin role when a new project or domain is created. (#11074) - Add
--waitflag to halfstackdocker compose upto ensure etcd passes its healthcheck before the installer proceeds to configuration, preventing a gRPC race condition duringconfigure_manager()(#11081) - Fix prometheus query preset fixture failing with "Unconsumed column names: category" by using category_name alias with FixtureReferenceSpec (#11086)
- Fix missing enum filter handling across deployment, session, kernel, vfolder, audit-log, login-session, and login-history domains, and standardize all enum filters to support equals/in/not_equals/not_in operators consistently. (#11092)
- Fix auto-scaling rule
last_triggered_atreturning fake timestamps instead of null, and addNullableDateTimeFilterfor filtering nullable datetime columns (#11102) - Add independent
cookie_secureconfig under[security]to set the Secure flag on session cookies, decoupled fromssl_enabledfor reverse proxy SSL termination environments. (#11105) - Increase default health check initial delay from 5 minutes to 30 minutes for all runtime variant generators to prevent premature failures during large model loading. (#11108)
- Automatically sync RBAC project-member role bindings when users are added to or removed from a project via
modifyGroupormodifyUser. (#11116) - Fix project creation to also create the member system role and backfill member roles for existing projects that were missing one. (#11118)
- Fix GQL serialization error for AgentStatus enum in agentsV2 query. (#11127)
- Correct category_id type to UUID in QueryDefinitionGQL and add GQL query/mutation support for prometheus query preset categories. (#11130)
- Fix alembic merge migration declaring non-head ancestors as parents, which caused
alembic upgrade headto fail. (#11131) - Fix inflated
total_countin theadmin_rolesGraphQL query caused by an unused LEFT JOIN onObjectPermissionRow. (#11132) - Allow auto scaling rule updates to clear nullable fields (
min_threshold,max_threshold,min_replicas,max_replicas,prometheus_query_preset_id) by sending an explicitnull, while keeping omitted fields unchanged. (#11137) - Validate that scope_id is a valid UUID for USER/PROJECT scope types in RBAC adapter, preventing email addresses from being stored as scope_id (#11138)
- Add RBAC validation to v2 session GET endpoint using SingleEntityActionProcessor (#11143)
- Fix incorrect vLLM default values in runtime variant preset fixture to match upstream defaults (#11144)
- Populate
deployment_revisions.model_definitionon both legacy endpoint creation (POST /func/services) and modify flows by running them through the unified revision merge pipeline so all sources —deployment-config.yaml, revision preset,model-definition.yaml, and request — flow throughRevisionDraft. On modify, the current revision is used as the lowest-priority base so untouched fields are preserved while yaml/preset refreshes remain authoritative. (#11145) - Fix
./bai admin deployment revision refreshfailing with aTypeErroron every deployment after the revision-merge pipeline refactor. (#11148) - Replace the non-existent
namefilter on deployment revisions with arevision_numberfilter and ordering across the DTO, GraphQL, REST, and CLI layers (./bai deployment revision search --name-containsis replaced by--revision-number). (#11150) - Rename the
ModelDeployment.createdUser/createdUserV2GraphQL fields to a singlecreatorfield. (#11152) - Backfill missing role-to-scope mappings in
association_scopes_entitiesfor migration-created SYSTEM roles so that GraphQL scope resolution no longer returns null (#11159) - Fix Prometheus range query 502 errors by accepting timezone-aware datetimes or Unix timestamps in CLI execute inputs (#11163)
- Populate revision-level fields on the legacy GQL endpoint response during the initial DEPLOYING phase by falling back to
deploying_revisionwhencurrent_revisionis unset, exposeresource_slotson the v2 revision response, and stop hard-coding cluster mode / size / runtime variant in the model-card and vfolderdeployadapters so the revision preset's values are no longer silently overridden. (#11167)
Documentation Updates
- Add data migration testing guideline to alembic CLAUDE.md. (#10936)
Miscellaneous
- Add a convenience script (
scripts/refresh-graphql-gateway.sh) to regenerate the GraphQL schema, copy it to the project root, and optionally restart the Apollo Router gateway in one step. (#11091)
Full Changelog
Check out the full changelog until this release (26.4.3).
Full Commit Logs
Check out the full commit logs between release (26.4.2) and (26.4.3).
26.4.2
Features
- Add optional
activateflag toadd_model_revisionAPI (#10468) - Add TCP appproxy worker installation support in dev installer (#10650)
- Add the
login_client_typestable, model, data dataclass, and repository so administrators can register and manage login client types at runtime. (#10822) - Add
owner_id(delegated user UUID) toEnqueueSessionInputfor delegated session ownership when enqueuing v2 sessions. (#10845) - Expose the
login_client_typesentity via the Strawberry GraphQL schema:loginClientType(id)single query,loginClientTypesConnection query with filter/order/pagination, andcreateLoginClientType/updateLoginClientType/deleteLoginClientTypemutations. (#10876) - Add
./baiCLI v2 commands forlogin_client_types:./bai login-client-type list/get(any authenticated user) and./bai admin login-client-type create/update/delete(super admin only). (#10878) - Add
--otel-endpointand--metric-access-cidroptions to TUI installer, configure announce-addr for manager/agent/storage-proxy, and add[otel]blocks to app-proxy halfstack configs (#10880) - Add vLLM runtime variant preset fixtures with automatic runtime_variant_name FK resolution in fixture populate (#10889)
- Add the
login_client_typeservice layer, v2 DTOs, and an admin-only search path (LoginClientTypeAdminRepository/LoginClientTypeAdminService/LoginClientTypeAdminProcessors) with filtering, ordering, and pagination support viaBatchQuerier. (#10923) - Add REST v2 CRUD endpoints for the
login_client_typesentity at/v2/login-client-types/, including a/v2/login-client-types/searchendpoint with filtering, ordering, and pagination support. (#10924) - Replace the hard-coded
LoginClientTypeenum with a foreign-key reference to thelogin_client_typestable in login sessions, allowing administrators to manage client types dynamically. (#10925) - Add Client SDK v2 domain client and CLI v2 commands for the
login_client_typesentity:./bai login-client-type get,./bai admin login-client-type search/create/update/delete. (#10942) - Add PROMETHEUS auto-scaling metric source that queries Prometheus directly via query presets, with bidirectional scaling support (scale-out/in thresholds in a single rule). (#10993)
- Add
user_idfilter to login session admin search andadmin_unblock_userAPI to clear failed-login rate limit blocks (#11011) - Add
creator_idcolumn to vfolders and wire VFolder ownership GQL resolvers (user, project, creator) to DataLoaders for proper entity resolution. (#11018) - Add deployment-scoped Prometheus query presets with category system, description, rank, and vLLM example fixtures (#11072)
Improvements
- Delete login session rows on termination and record full session lifecycle events in login history (#11013)
- Add explicit LabelMatcher to Prometheus query presets to support regex matching operators (#11025)
Fixes
- Rename
TooManyConcurrentLoginSessionserror type fromtoo-many-concurrent-loginstoactive-login-session-existsto match actual error semantics (#5691) - Fix imagify API handler that incorrectly parsed POST body as query parameters by switching from QueryParam to BodyParam (#5694)
- Return HTTP 409 (Conflict) instead of 429 (Too Many Requests) for
TooManyConcurrentLoginSessionserror (#10992) - Re-read model definition from vfolder when legacy
modify_endpointcreates a new revision, so on-disk file changes are reflected. Also triggerCHECK_REPLICAlifecycle on revision-level field changes to notify the deployment controller. (#10994) - Fix OIDC AUTHORIZE hook to read sToken from hook params before falling back to cookies, enabling token-login flow via JSON body. (#11002)
- Fix
GET /stream/session/{name}/execute500 error by sharing a singlePrivateContextbetween the stream handler and its lifecycle hook, sostream_execute_handlersis initialized on the instance the handler reads at request time. (#11003) - Fix per-container CUDA metric collection failing due to missing
container.show()call ingather_container_measures(#11006) - Fix double /func/ prefix in session-mode GQL path causing HTTP 404 (#11007)
- Fix 500 Internal Server Error when creating a session with an invalid or non-member project group by replacing plain
ValueErrorwith properBackendAIErrorsubclasses inquery_userinfo(). (#11012) - Fix RBAC action validators silently bypassing permission denials; legacy processor paths now observe denials via log and metric instead of raising. (#11014)
- Fix TERMINATED transition hook blocking session termination when model-definition.yaml is missing from storage for custom-runtime inference sessions. (#11019)
- Fix endpoint destroy failing with
UniqueViolationErroronix_endpoints_unique_name_when_not_destroyedby narrowing the partial unique index predicate to exclude DESTROYING/DESTROYED states. (#11020) - Make
client_type_idoptional inAuthorizeRequestso clients that do not specify a login client type (e.g., WebUI) can still authenticate, and add the missing migration for thelogin_client_type_idcolumn on thelogin_sessionstable. (#11022) - Fix GQL user adapter to handle
not_equalsandnot_inoperations in status and role filter conversion, which were previously silently ignored. (#11024) - Fix route health initial_delay calculation to use running_at instead of route creation time, preventing premature session termination for custom runtime variants with long model loading times. (#11029)
- Add missing
server_defaulttoimages.last_used_atcolumn so that new image rows without an explicitlast_used_atvalue no longer violate the NOT NULL constraint. (#11031) - Fix endpoint status to reflect route health check results instead of only lifecycle status (#11033)
- Set Secure flag on session cookie when SSL is enabled. (#11035)
- Fix Pydantic validation error when using
orderByin deployment-related GraphQL queries (autoScalingRules,deployments,replicas,accessTokens) (#11037) - Fix Prometheus metrics silently missing on Linux by separating the multiprocess setup module to prevent import-time
ValueClassmisfire. (#11038) - Fix orphan
login_sessionsrows after WebUI logout when authenticated via the keypair (sToken) login flow. (#11042) - Fix
TypeErrorin TOTP hook during stoken login by using attribute access on the user Row object. (#11064) - Handle null
HostConfig.DeviceRequestsfrom Docker API in CUDA container measures to preventTypeError. (#11070) - Bypass RBAC permission checks for superadmin users in all action validators so superadmin operations (e.g. project creation) no longer fail with
NotEnoughPermission. (#11071) - Add RBAC validation to deployment get/update/destroy, fix keypair resource policy lookup by wrong column, and move resource-group CLI commands to admin scope (#11076)
Test Updates
- Add component test verifying that exceeding
max_concurrent_loginsreturns HTTP 409 Conflict (#10997)
Full Changelog
Check out the full changelog until this release (26.4.2).
Full Commit Logs
Check out the full commit logs between release (26.4.1) and (26.4.2).
26.4.1
Fixes
- Fix Pydantic validation error when creating ModelCard with null framework, label, or accessLevel fields via GraphQL (#10921)
- Fix model service creation failing with Pydantic validation error when using fractional
cuda.sharesresource values (e.g., 2.5) (#10929) - Fix backfill migration referencing dropped
permission_groupstable; use denormalized permissions schema instead. (#10933) - Fix migration failure on BinarySize-suffixed resource_slots values (e.g. "32g", "4m"). (#10934)
- Fix superadmin unable to see other users' vfolders via vfolder_nodes GQL query due to empty ADMIN_PERMISSIONS (#10939)
- Skip event deserialization in event dispatcher when no consumer or subscriber is registered, preventing
ModuleNotFoundErrorin appproxy coordinator (#10941) - Fix legacy GQL endpoint resolvers crashing when routings is empty by using
is not Nonecheck instead of truthiness check, and add missingload_routesinload_all. (#10948) - Fix CLI v2
RuntimeError: no running event loopcrash on aiohttp >= 3.13 by deferringCookieJarcreation to an async context (#10954) - Fix IndexError in health check handlers caused by incompatible
web.Requestannotation in_wrap_api_handler; now useRequestCtxparameter type. (#10958) - Fix
ModelDefinition.merge()corruptingstart_commandvia index-based list merging by replacingdeep_merge()with Pydantic-aware field-by-field merge functions (#10959) - Normalize
Noneroutings to empty list in endpointto_data()andfrom_dto()to fix NoneType iteration crashes (#10965) - Fix GQL
my_client_ipreturning the hive-gateway proxy IP by forwarding theX-Forwarded-Forheader from hive-gateway to manager subgraph requests (note:allowed_client_ipconfigurations that whitelisted the hive-gateway IP as a workaround should be reviewed, as the manager will now see the real client IP via GQL) (#10966) - Expose
AND/OR/NOTcomposition on theModelCardV2FilterGraphQL input so composed filter queries no longer fail withField "AND" is not defined(#10970)
Full Changelog
Check out the full changelog until this release (26.4.1).
Full Commit Logs
Check out the full commit logs between release (26.4.0) and (26.4.1).
26.4.0
Features
v2 API, SDK & CLI
Delivered REST v2 endpoints for all 26 API domains, migrated GraphQL to Strawberry-backed Pydantic types with PydanticNodeMixin and domain Adapters, and added the v2 client SDK and CLI with entity-command structure covering admin CRUD, user self-service, and raw GraphQL operations.
- Add DataLoader for batched role assignment queries by user ID and
my_rolesfield on UserV2 to prevent N+1 queries. (#9552) - Migrate AuditLog GraphQL API to Strawberry with cursor-based pagination and filtering support (#10065)
- Add Strawberry GraphQL node type for ContainerRegistry to support RBAC entity resolution (#10093)
- Add
activeResourceOverviewGraphQL field toDomainandProjecttypes, exposing currently occupied resource slots and active session count. (#10095) - Add AND, OR, NOT logical operators to GraphQL filter types for complex boolean filter expressions. (#10250)
- Migrate GraphQL layer to Pydantic-backed types by introducing PydanticNodeMixin, domain Adapters, and @strawberry.experimental.pydantic.input across all GQL domains. (#10299)
- Add
update_deployment_policyGQL mutation (#10300) - Add
execute_bulk_purger_partial()function to support partial failure handling for bulk delete operations with savepoint-based transaction isolation (#10332) - Add UUID-based single-entity User CRUD (create/update/delete/purge) to the GraphQL v2 API, resolving six previously stubbed mutations. (#10403)
- Add
my_keypairsGraphQL query to list the current user's keypairs with filter, orderBy, and cursor/offset pagination support. (#10404) - Add
optionsfield toPurgeUserV2Inputto control purge behavior (migrate shared vfolders, delegate endpoint ownership). (#10498) - Add REST v2 API endpoints for all 26 domains under the
/v2/prefix, reusing existing v2 DTO adapters shared with GraphQL. (#10499) - Add v2 client SDK and CLI with
[admin] {entity} [{sub-entity}] {operation}command structure,~/.backend.ai/config system, and./baishortcut for all 26 domains. (#10504) - Add admin CRUD mutations to v2 API for Domain, Project, ContainerRegistry, and Image entities with full stack coverage (Adapter, REST v2, SDK v2, CLI v2, GQL) (#10516)
- Add
./bai gqlCLI command and SDK client for sending raw GraphQL queries, supporting both legacy and Strawberry schemas. (#10539) - Add VFolder adapter with admin search implementation including filter, order, and pagination support (#10569)
- Add v2 session REST API with enqueue, search (admin/my/project-scoped), get, terminate (batch), start/shutdown-service, logs, and update endpoints (#10599)
- Define VFolder Strawberry GQL node and nested field group types for the Graphene-to-Strawberry migration. (#10603)
- Add VFolder filter and order-by Strawberry GQL types for v2 queries with AND/OR/NOT logical operators (#10604)
- Add missing update, execute commands and admin CLI module for v2 prometheus-query-preset (#10606)
- Add v2 export REST API, client SDK, and CLI commands for CSV export operations (#10609)
- Add REST v2 and GraphQL endpoints for unassigning users from a project, with failure information for non-existent or unassigned user IDs. (#10632)
- Add REST v2 endpoint for assigning users to projects with RBAC enforcement (#10633)
- Add resource policy v2 API with Strawberry GQL, REST v2, SDK, and CLI for keypair/user/project resource policies, replacing JSON fields with typed structures. (#10634)
- Add Strawberry GraphQL resolvers for container registry v2 with search, create, update, delete operations and full filter/orderBy/pagination support. (#10635)
- Add resource group allow/disallow API for bidirectional domain and project association management with atomic add/remove in a single request (#10636)
- Add resource preset v2 CRUD API with shared BinarySizeInput/BinarySizeInfo types for byte-size fields (#10637)
- Add scope-based resource allocation v2 APIs with effective assignable computation and preset availability check (#10638)
- Add keypair admin CRUD v2 API (search, get, create, update, delete) across GQL, REST, SDK, and CLI (#10640)
- Add search_vfolders operation with repository, service, and processor layers (#10641)
- Add search_user_vfolders operation with repository, service, and processor layers (#10642)
- Add cloneable filter to VFolder my_search query pipeline (#10674)
- Wire
myVfoldersGraphQL query resolver and registerVFolderAdapterfor end-to-end user vfolder search (#10677) - Add required
role_idparameter to the assign-users-to-project API so that users receive a project role upon assignment. (#10688) - Add WebSocket transport (
graphql-transport-wsprotocol) for Strawberry GraphQL subscriptions, enabling the Hive Gateway to forward subscriptions to the manager. (#10739) - Add user.id (UUID) filter to ProjectUserFilter for /v2/projects/search (#10793)
- Add project-scoped role search API (GQL, REST v2, SDK, CLI) to discover roles available within a project. (#10794)
- Add
my_storage_host_permissionsquery,deployVFoldermutation, admin SSH keypair management,storage_hostfilter for model card search, andin/not_in/i_in/i_not_inoperators for StringFilter. (#10887)
Pydantic DTO v2 Models
Defined comprehensive Pydantic v2 DTO types across all 26 API domains, establishing the typed Input/Node/Payload naming convention with SENTINEL pattern for nullable-clearable update fields and full unit test coverage.
- Add Pydantic DTO v2 model structure for RBAC Role domain with Input, Node, and Payload types, establishing conventions for future domain DTOs. (#10253)
- Add Pydantic DTO v2 models for
authandacldomains undercommon/dto/manager/v2/.
Theauthmodule includes 9 Input models and 10 Payload models with nested sub-models (AuthCredentialInfo, TwoFactorInfo, RoleInfo, SSHKeypairInfo, PasswordChangeInfo).
Theaclmodule includes GetPermissionsPayload and VFolderHostPermission re-export. (#10254) - Add Pydantic DTO v2 models for manager API domains: config (Dotfile, BootstrapScript), etcd (ConfigKey, ResourceMetadata), system (SystemVersion), infra (ScalingGroup, ResourcePreset, Usage, Watcher, ContainerRegistry), and operations (ErrorLog, ManagerStatus, Announcement, SchedulerOps, SessionEvents). (#10255)
- Add Pydantic DTO v2 models for
scaling_group,resource_group,resource_slot, andresource_policydomains undercommon/dto/manager/v2/, including request (Input), response (Node/Payload) models with SENTINEL pattern, and comprehensive unit tests for all four domains. (#10256) - Add Pydantic DTO v2 models for
event_stream,streaming, andexportdomains undersrc/ai/backend/common/dto/manager/v2/, with comprehensive unit tests for all three packages. (#10257) - Add Pydantic DTO v2 models for
session,compute_session, andagentdomains undersrc/ai/backend/common/dto/manager/v2/, following the Input/Node/Payload naming convention with nested sub-models for semantic field grouping. (#10258) - Add Pydantic DTO v2 models for user, domain, and group entities under
common/dto/manager/v2/, with nested sub-models, SENTINEL pattern for nullable-clearable update fields, and comprehensive unit tests. (#10259) - Add Pydantic DTO v2 models for
image,scheduling_history, andauto_scaling_ruledomains with full unit test coverage. (#10260) - Add Pydantic DTO v2 models for vfolder, object_storage, quota_scope, and storage domains under
ai.backend.common.dto.manager.v2, with comprehensive unit tests for each domain. ([#10261](https://github.com/lablu...
26.4.0rc1
Features
-
Add shell auto-completion support for Backend.AI CLI (#7021)
-
Add DataLoader for batched role assignment queries by user ID and
my_rolesfield on UserV2 to prevent N+1 queries. (#9552) -
Add RBAC validator infrastructure to Session actions following BEP-1048 patterns (#9624)
-
Migrate Session entities to RBAC database with entity-type permissions and AUTO scope associations (#9636)
-
Add CLI commands for prometheus query definition admin CRUD and execution (#9641)
-
Support cloning vfolders to a different quota scope by adding
target_quota_scope_idparameter to the clone API. (#9741) -
Add per-container metric collection support for CUDA devices (#9787)
-
Update ATOM plugin definition to be conformant of rebellions CDI architecture (#9788)
-
Implement Rolling Update deployment strategy (#9997)
-
Apply RBAC Creator pattern to ArtifactRevision for consistent entity creation and access control (#10021)
-
Apply RBAC validator for App config actions (#10028)
-
Apply RBAC validators to project (group) action processors for proper permission enforcement (#10029)
-
Apply RBAC validator for Model Artifact Registry actions (#10032)
-
Apply RBAC permission validators to model deployment service actions (#10033)
-
Apply RBAC validator for Keypair actions to enforce permission checks on create, get, update, delete, and purge operations (#10051)
-
Apply RBAC validator for User actions following the established pattern from Group, VFolder, and Session services (#10055)
-
Apply RBAC validators to Image service actions for proper authorization checks (#10059)
-
Migrate AuditLog GraphQL API to Strawberry with cursor-based pagination and filtering support (#10065)
-
Add self-service keypair issue/revoke/switch GraphQL mutations (#10066)
-
Add self-service IP allowlist mutation with lockout prevention (#10067)
-
Seed built-in container utilization metric query presets (gauge, rate, diff) previously hardcoded in
ContainerUtilizationMetricServiceas configurable DB fixtures and Alembic data migration (#10090) -
Add Strawberry GraphQL node type for ContainerRegistry to support RBAC entity resolution (#10093)
-
Add
activeResourceOverviewGraphQL field toDomainandProjecttypes, exposing currently occupied resource slots and active session count. (#10095) -
Add AND, OR, NOT logical operators to GraphQL filter types for complex boolean filter expressions. (#10250)
-
Add Pydantic DTO v2 model structure for RBAC Role domain with Input, Node, and Payload types, establishing conventions for future domain DTOs. (#10253)
-
Add Pydantic DTO v2 models for
authandacldomains undercommon/dto/manager/v2/.
Theauthmodule includes 9 Input models and 10 Payload models with nested sub-models (AuthCredentialInfo, TwoFactorInfo, RoleInfo, SSHKeypairInfo, PasswordChangeInfo).
Theaclmodule includes GetPermissionsPayload and VFolderHostPermission re-export. (#10254) -
Add Pydantic DTO v2 models for manager API domains: config (Dotfile, BootstrapScript), etcd (ConfigKey, ResourceMetadata), system (SystemVersion), infra (ScalingGroup, ResourcePreset, Usage, Watcher, ContainerRegistry), and operations (ErrorLog, ManagerStatus, Announcement, SchedulerOps, SessionEvents). (#10255)
-
Add Pydantic DTO v2 models for
scaling_group,resource_group,resource_slot, andresource_policydomains undercommon/dto/manager/v2/, including request (Input), response (Node/Payload) models with SENTINEL pattern, and comprehensive unit tests for all four domains. (#10256) -
Add Pydantic DTO v2 models for
event_stream,streaming, andexportdomains undersrc/ai/backend/common/dto/manager/v2/, with comprehensive unit tests for all three packages. (#10257) -
Add Pydantic DTO v2 models for
session,compute_session, andagentdomains undersrc/ai/backend/common/dto/manager/v2/, following the Input/Node/Payload naming convention with nested sub-models for semantic field grouping. (#10258) -
Add Pydantic DTO v2 models for user, domain, and group entities under
common/dto/manager/v2/, with nested sub-models, SENTINEL pattern for nullable-clearable update fields, and comprehensive unit tests. (#10259) -
Add Pydantic DTO v2 models for
image,scheduling_history, andauto_scaling_ruledomains with full unit test coverage. (#10260) -
Add Pydantic DTO v2 models for vfolder, object_storage, quota_scope, and storage domains under
ai.backend.common.dto.manager.v2, with comprehensive unit tests for each domain. (#10261) -
Add Pydantic DTO v2 models for
artifact,artifact_registry, andcontainer_registrydomains undercommon/dto/manager/v2/, including typed Input, Node, and Payload models with full unit test coverage. (#10262) -
Add Pydantic DTO v2 models (
types.py,request.py,response.py,__init__.py) fordeployment,model_serving, andservice_catalogmanager API domains, with comprehensive unit tests. (#10263) -
Add Pydantic DTO v2 models for
notification,error_log,fair_share, andprometheus_query_presetdomains undersrc/ai/backend/common/dto/manager/v2/, with comprehensive unit tests for each domain. (#10264) -
Use deploying-revision image for new route session creation (#10271)
-
Integrate pyinfra deployment framework from backend.ai-installer into the unified install package, enabling production deployment via PyInfra alongside existing Docker-based development setup.
Key additions:
- PyInfra framework (runner, configs, os_packages) with enterprise config schemas (enabled=False in OSS)
- OSS deploy scripts (os, halfstack, cores, monitor) - 318 files, 82K+ lines
- TUI PACKAGE mode now offers choice: Release Package (existing) or Production Deployment (PyInfra)
- Horizontal card layout with keyboard navigation for deployment type selection (#10275)
-
Consolidate deploying handlers and remove unused sub-steps (#10276)
-
Migrate GraphQL layer to Pydantic-backed types by introducing PydanticNodeMixin, domain Adapters, and @strawberry.experimental.pydantic.input across all GQL domains. (#10299)
-
Add
update_deployment_policyGQL mutation (#10300) -
Add internal health endpoint (
/health) to the manager's internal app, and simplify the public health handler to a plain liveness probe. (#10308) -
Add
update_my_keypairGQL mutation to allow users to toggle their keypair's active state (is_active) (#10309) -
Support resolving session entities in RBAC entity and permission scope queries (#10320)
-
Add
execute_bulk_purger_partial()function to support partial failure handling for bulk delete operations with savepoint-based transaction isolation (#10332) -
Add
PROJECT_ADMIN_PAGEandDOMAIN_ADMIN_PAGEguarded RBAC entities for admin page access control. (#10334) -
Add repository-layer support for filtering role assignments by permissions via PermissionConditions and exists_permission_combined (#10397)
-
Add UUID-based single-entity User CRUD (create/update/delete/purge) to the GraphQL v2 API, resolving six previously stubbed mutations. (#10403)
-
Add
my_keypairsGraphQL query to list the current user's keypairs with filter, orderBy, and cursor/offset pagination support. (#10404) -
Remove
rollback_on_failurefrom DB schema, API, and related code (#10410) -
Add
last_used_atreal column to images table, replacing the computed subquery with a direct DB column updated o...
26.3.3
Fixes
- Fix ON CONFLICT column mismatch in vfolder invitation RBAC remigration causing InvalidColumnReferenceError during alembic upgrade. (#10471)
Full Changelog
Check out the full changelog until this release (26.3.3).
Full Commit Logs
Check out the full commit logs between release (26.3.2) and (26.3.3).
26.3.2
Fixes
- Add safe Prometheus metric wrappers to prevent mmap error propagation into business logic (#10395)
- Remove duplicate
debugfield in the webserver'sconfig.toml.j2template (#10423) - Add missing OpenTelemetrySpec initialization in the manager, enabling trace and log export to the OTEL Collector. (#10439)
Full Changelog
Check out the full changelog until this release (26.3.2).
Full Commit Logs
Check out the full commit logs between release (26.3.1) and (26.3.2).
26.3.1
Features
- Add AND, OR, NOT logical operators to GraphQL filter types for complex boolean filter expressions. (#10250)
- Add internal health endpoint (
/health) to the manager's internal app, and simplify the public health handler to a plain liveness probe. (#10308)
Improvements
- Add
TimeoutSecondsannotated type to centralize and simplify session timeout validation in request DTOs. (#10267)
Fixes
- Fix global container registry RBAC migration to map to project scopes instead of domain scopes (#10082)
- Fix resource preset check returning incorrect occupancy when scaling groups have no active sessions (#10268)
- Fix session dependency GraphQL dataloaders returning empty results due to incorrect key mapping and missing eager loading (#10280)
- Restore
dbandconfig_provideraccess for webapp plugins (OpenID, TOTP) after DI refactoring by injecting them into the root app context (#10292) - Add otp field to AuthorizeRequest and AuthorizeAction for TOTP two-factor authentication compatibility. (#10305)
- Exclude unmeasurable metrics from utilization idle check instead of treating stat collection failures as 0% usage (#10316)
- Restore
etcdandvalkey_stataccess for webapp plugins (Cloud) after DI refactoring by injecting them into the root app context (#10318) - Route authenticated TOTP endpoints through web_handler instead of anonymous handler (#10345)
Full Changelog
Check out the full changelog until this release (26.3.1).
Full Commit Logs
Check out the full commit logs between release (26.3.0) and (26.3.1).
25.11.4
Features
- Execute Resource Usage Recalculation periodically (#5646)
Improvements
- Add per-plugin timeout (120s) to
gather_container_measurescalls so a single hung plugin does not block stat collection from all other plugins (#9781)
Fixes
- Fix wrong value type of Valkey client address (#5649)
- Remove the unnecessary
asyncio.LockfromStatContextas self-concurrency is already prevented byTimerDelayPolicy.CANCELand each collect method operates on independent data structures (#9256) - Fix container net_rx/net_tx stats reading host namespace counters due to unchecked setns() return value (#9681)
- Pre-validate namespace path before
netstat_ns()to prevent thread pool exhaustion from hung threads on stale network namespaces (#9782) - Exclude unmeasurable metrics from utilization idle check instead of treating stat collection failures as 0% usage (#10316)
Full Changelog
Check out the full changelog until this release (25.11.4).
Full Commit Logs
Check out the full commit logs between release (25.11.3) and (25.11.4).
26.3.0
Features
Client SDK v2
Delivered a complete rewrite of the Backend.AI client library as SDK v2, providing a typed async HTTP client with injectable authentication, domain-specific client classes covering every API surface area, WebSocket/SSE streaming support, and Pydantic-typed request/response DTOs across all domains.
- Add SDK v2 base architecture with async HTTP client, injectable auth, typed exceptions, and domain client stubs (#8903)
- Add 204 No Content response support and
typed_request_no_content()method to SDK v2 base client (#8936) - Add WebSocket and SSE connection support to SDK v2
BackendAIClient(#8937) - Add binary and multipart request support (upload/download) to SDK v2 base client and restore deferred session file operations (#8952)
- Add anonymous client to SDK v2 for unauthenticated endpoints (#9478)
- Add
extra_headerssupport toBackendAIAnonymousClient, allowing per-request header injection (e.g.X-Forwarded-For,X-Forwarded-Host,X-Forwarded-Proto) when proxying requests to the manager (#9594) - Register a persistent
BackendAIClientRegistryon the webserver and use it for theupdate-password-no-authAPI handler (#9595) - Implement SDK v2
AuthClientwith async methods for all auth domain REST endpoints (#8913) - Add SDK v2 client for Config and Infrastructure domains with full async REST API coverage (#8914)
- Add SDK v2 client for Model Serving domain with all 14 API methods (list, search, create, delete, scale, sync, routing, token, errors, runtimes) (#8915)
- Implement SDK v2
SessionClientwith typed async methods for all session management endpoints (#8916) - Add SDK v2 domain clients for container registry and storage endpoints (#8918)
- Add SDK v2
VFolderClientwith full virtual folder management endpoint coverage including CRUD, file operations, sharing/invitations, and admin operations (#8919) - Add SDK v2 client for Template and Operations domains with full CRUD support (#8912)
- Add SDK v2
StreamingClientfor WebSocket (terminal, execute, proxy) and SSE (session events, background task events) operations (#8939) - Add SDK v2
NotificationClientfor notification channel and rule management (#8940) - Add
FairShareClientfor SDK v2 covering all fair-share scheduling endpoints including domain/project/user fair shares, usage buckets, weight management, and resource group spec operations (#8941) - Add SDK v2 clients for scheduling_history, agent, and compute_session domains (#8942)
- Add
RBACClientfor RBAC domain in SDK v2, covering role management, role assignment, scope management, and entity management (#8944) - Add SDK v2
DeploymentClientcovering deployment CRUD, revision management, and route operations with typed request/response DTOs (#8945) - Add
ExportClientto SDK v2 covering CSV data export and report endpoints (#8963) - Add group/project CRUD and member management methods to SDK v2
GroupClient(#8964) - Add
ArtifactClientfor artifact lifecycle management in SDK v2, covering import, update, cleanup, approval/rejection, and query endpoints (#8968) - Add
ResourcePolicyClientto SDK v2 for managing keypair, project, and user resource policies with full CRUD and search operations (#8969) - Add
ArtifactRegistryClientto SDK v2 for registry scanning and federation operations (#8970) - Expand
AgentClientandComputeSessionClientin SDK v2 with detail retrieval and resource stats methods (#8971) - Add dedicated DTO package for
ContainerRegistryClientin SDK v2 (#8983) - Add
ACLClientwith DTO for permission queries in SDK v2 (#8984) - Add
ErrorLogClientdomain client in SDK v2 for error log management (append, list, mark-cleared) (#8985) - Add
ScalingGroupClientwith DTO for scaling group queries in SDK v2 (#8986) - Add
ObjectStorageClientto SDK v2 for presigned URL generation and bucket management (#8987) - Add
EventStreamClientfor SSE-based session and background task event streaming in SDK v2 (#8988) - Add Domain CRUD client to Client SDK v2 with create, get, search, update, delete, and purge operations (#9059)
- Add System version info client (
SystemClient.get_versions()) to Client SDK v2 (#9060) - Add
QuotaScopemanagement to Client SDK v2 with get, search, set_quota, and unset_quota operations (#9061) - Add Image management client to Client SDK v2 with search, get, rescan, alias, dealias, forget, and purge operations (#9062)
- Add User admin CRUD client (create, get, search, update, delete, purge) to Client SDK v2 (#9063)
- Add
ServiceAutoScalingRuleCRUD client to Client SDK v2 (#9064)
REST API Handler Migration
Migrated all 39 API handler modules from module-level functions to a unified Handler class pattern with constructor dependency injection, Pydantic DTOs, and APIResponse wrappers via RouteRegistry, establishing a consistent foundation for the v2 REST API layer.
- Add
RouteRegistryclass with per-route middleware support, enabling unified route registration as a foundation for handler migration (#9407) - Convert auth handlers to the new
ApiHandlerpattern with typed parameters and automatic request extraction viaRouteRegistry(#9410) - Integrate automatic
api_handlerwrapping intoRouteRegistry.add(), so all registered handlers get typed parameter extraction andAPIResponseconversion without explicit decorators (#9414) - Migrate resource handlers to constructor DI pattern with
ResourceHandlerclass andregister_routes()(#9453) - Migrate session handlers to the Handler class pattern with constructor DI, Pydantic DTOs, and
APIResponse, keepingcreate_app()as a backward-compatible shim (#9455) - Migrate stream and events handlers to Handler class pattern with constructor DI and Pydantic DTOs (#9444)
- Migrate userconfig, domainconfig, and groupconfig handlers to the Handler class pattern with Pydantic DTOs and
APIResponse(#9445) - Migrate cluster_template and session_template handlers to Handler class pattern with Pydantic DTOs and
APIResponse(#9446) - Migrate service (model serving) handlers from module-level functions to the new
ApiHandlerclass pattern with constructor DI, Pydantic DTOs, andAPIResponse(#9447) - Migrate scaling_group, acl, logs, and ratelimit handlers to
ApiHandlerclass pattern with Pydantic DTOs andAPIResponse(#9448) - Migrate admin, spec, and group handlers to Handler class pattern with constructor DI, Pydantic DTOs, and
APIResponse(#9449) - Migrate resource and manager handlers to Handler class pattern with Pydantic DTOs and
APIResponse(#9450) - Migrate etcd and container_registry handlers to Handler class pattern with Pydantic DTOs,
APIResponse, andRouteRegistry(#9456) - Migrate session and deployment handlers to constructor DI pattern with backward-compatible
create_app()shims (#9457) - Migrate RBAC handlers to constructor dependency injection pattern with
register_routes()entry point, keepingcreate_app()as a backward-compatible shim (#9458) - Migrate vfolder handlers from module-level functions to
VFolderHandlerclass with constructor DI, Pydantic DTOs, andAPIResponse(#9459) - Migrate artifact and artifact_registry handlers ...