feat: bbolt persistent pod device configs by rbtr · Pull Request #115 · kubernetes-sigs/dranet

rbtr · 2026-03-27T22:22:08Z

Replace the in-memory-only store with a bbolt-backed persistent store using a layered write-through cache architecture where the in-memory PodConfigStore remains the source of truth for all reads, with an optional Checkpointer interface as a write-through persistence backend. This follows the kubelet DRA checkpoint pattern (pkg/kubelet/cm/dra/state).

Define a Checkpointer interface (GetOrCreate, Store, DeletePod, Close) as the persistence contract. Add a boltCheckpointer implementation.
Modify PodConfigStore to accept an optional Checkpointer. Writes persist to the checkpointer before updating memory; if persistence fails, memory is unchanged. Deletes proceed with memory removal regardless — stale checkpoint entries are harmless and cleaned up by Synchronize on the next startup.
Reads (GetPodConfig, GetDeviceConfig) are served from memory in the NRI hot path.
Ephemeral state (LastNRIActivity) stays entirely in-memory in the PodConfigStore and never reaches the persistence layer.
Wire persistence into the driver lifecycle: WithDBPath option configures the bolt store in Start(), and Close() shuts down the database.
Configure the database path as /var/run/dranet/dranet.db and add a corresponding hostPath volume (DirectoryOrCreate) in install.yaml. Set --db-path to empty to disable persistence.

Fixes: #89

k8s-ci-robot · 2026-03-27T22:22:19Z

Hi @rbtr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

aojea · 2026-03-28T06:50:36Z

/ok-to-test

I wonder if it will be better organized if we define a new model for Store or Storage that implements the interface, so we can use any technology and keeps the code better organized.

There is also the need to synchronize the information, during a dranet restart the state may have changed, so we need to be able to guarantee the cluster , node and db state are in sync.

rbtr · 2026-03-30T18:03:19Z

I wonder if it will be better organized if we define a new model for Store or Storage that implements the interface, so we can use any technology and keeps the code better organized.

I think, besides the names, this is already set up in this direction. It's behind an interface; new buckets can be created under the same bolt and once there is a need for expanded storage interfaces it can be refactored and expanded accordingly. I can explore that refactor now if you would prefer but I think it's mostly rearranging/renaming and not a functional change here.

There is also the need to synchronize the information, during a dranet restart the state may have changed, so we need to be able to guarantee the cluster , node and db state are in sync.

Added an init state synchronization, lmk what you think

gauravkghildiyal · 2026-04-01T01:29:16Z

/assign

gauravkghildiyal

Thanks for the patience @rbtr! This is looking really cool

Haven't reviewed the tests yet, but I thought I'll share this before. I was thinking about the PodConfigStore setup and instead of the current "either-or" approach where we choose between an in-memory or a Bolt implementation at startup, what if we just have one in-memory store that handles all the logic and just uses a persistence backend (Bolt being an implementation of that persistence backend interface)?

Something like: driver -> PodConfigStore -> BoltStore

There's a couple of reasons why I'm thinking of structuring things that way:

GetPodConfig is in the hot path of all NRI hooks. This function is a bottleneck for RunPodSandbox, CreateContainer, and StopPodSandbox, which are currently invoked for each and every pod on the node, regardless of whether it uses DRA or not. In the current BoltPodConfigStore implementation, we are forced to fetch from the DB and perform data unmarshalling for each of these operations. Having a local cache (like PodConfigStore) would allow us to serve these requests from RAM instantly.
As things are implemented right now, we are forced to implement ephemeral state like NRIActivity into the entire BoltPodConfigStore. This mixing of concerns smells like it needs some better structuring. If we were to have PodConfigStore consume a persistent storage like BoltStore, we could avoid implementing this in the persistence layer entirely and keep the ephemeral state strictly in the top-level store.

Also, it looks like Kubelet does something similar for DRA in claiminfo.go. Basically keeping the state in RAM and using the disk as a backup so it survives restarts.

In the proposed implementation, I imagine:

Most of the reads in this case will be served directly from memory.
The interface between PodConfigStore -> BoltStore will be quite minimal

gauravkghildiyal · 2026-04-07T01:00:32Z

pkg/driver/pod_device_config_bolt.go

+		if err != nil {
+			return err
+		}
+		data, err := json.Marshal(config)


I'll recommend us to have another bucket (or a key-value pair if that serves better) to group the deviceConfigs, rather than storing it directly within the podUID bucket.

So instead of pod_configs => <POD_UID> => <ListOfDevices>, we instead of pod_configs => <POD_UID> => device_configs => <ListOfDevices>

Rationale being that it's very likely we'll need a Pod level config as well which is not specific to a device. (Think of something like NRIActivity, due to which the existing PodConfigStore is also structured in equivalent manner)

The DRANET driver maintains a podConfigStore that tracks per-pod network and device configurations. This state is populated during the DRA NodePrepareResource phase and consumed by NRI hooks (RunPodSandbox, CreateContainer) to inject devices and apply network config. Because the store was purely in-memory, a daemon restart between these two phases caused the NRI hooks to silently skip device injection — pods would run without their intended network configuration. Replace the in-memory-only store with a bbolt-backed persistent store: - Extract a podConfigStorer interface from the existing PodConfigStore, with methods for SetDeviceConfig, GetDeviceConfig, GetPodConfig, DeletePod, DeleteClaim, UpdateLastNRIActivity, and GetPodNRIActivities. - Add BoltPodConfigStore, a new implementation backed by bbolt that persists DeviceConfig entries as JSON in a "pod_configs" bucket keyed by "podUID/deviceName". LastNRIActivity remains ephemeral in memory since it is only used for graceful shutdown coordination. - Add json struct tags to DeviceConfig, RDMAConfig, and LinuxDevice to enable serialization. The apis.NetworkConfig types already had them. - Wire persistence into the driver lifecycle: WithDBPath option selects the bolt store in Start(), and Stop() closes the database. - Configure the database path as /var/run/dranet/pod_configs.db and add a corresponding hostPath volume (DirectoryOrCreate) in install.yaml. - The in-memory PodConfigStore remains available and is used by default in tests and when no DB path is configured. Fixes: kubernetes-sigs#89 Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>

Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>

k8s-ci-robot · 2026-04-07T16:01:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rbtr
Once this PR has been reviewed and has the lgtm label, please ask for approval from gauravkghildiyal. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Refactor the pod config persistence to use a layered architecture following the kubelet DRA checkpoint pattern (pkg/kubelet/cm/dra/state): - Define a Checkpointer interface as the minimal persistence contract (GetOrCreate, Store, DeletePod, Close). The in-memory PodConfigStore is the source of truth; the Checkpointer is a write-through backend. - Refactor BoltPodConfigStore into an unexported boltCheckpointer that implements only the Checkpointer interface. Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>

rbtr · 2026-04-07T16:48:04Z

@gauravkghildiyal thanks for this feedback. I agree with your suggestions here and have refactored the proposal to be more of a "write-through cache" pattern. It's basically modeled on that Kubelet precedent; even though bolt is probably fast enough to serve reads directly I think it's a better architecture this way. LMK what you think

k8s-ci-robot requested a review from aojea March 27, 2026 22:22

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 27, 2026

k8s-ci-robot requested a review from MikeZappa87 March 27, 2026 22:22

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 27, 2026

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 27, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 31, 2026

k8s-ci-robot assigned gauravkghildiyal Apr 1, 2026

gauravkghildiyal reviewed Apr 7, 2026

View reviewed changes

rbtr added 3 commits April 7, 2026 14:26

fix lints

eac4b18

Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>

reconcile state on startup

689de48

Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>

rbtr force-pushed the feat/bbolt-persistence branch from 8e31ffe to 2f93b2d Compare April 7, 2026 16:00

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 7, 2026

rbtr force-pushed the feat/bbolt-persistence branch from 2f93b2d to 92be948 Compare April 7, 2026 16:34

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: bbolt persistent pod device configs#115

feat: bbolt persistent pod device configs#115
rbtr wants to merge 4 commits intokubernetes-sigs:mainfrom
rbtr:feat/bbolt-persistence

rbtr commented Mar 27, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Mar 27, 2026

Uh oh!

aojea commented Mar 28, 2026

Uh oh!

rbtr commented Mar 30, 2026 •

edited

Loading

Uh oh!

gauravkghildiyal commented Apr 1, 2026

Uh oh!

gauravkghildiyal left a comment

Uh oh!

gauravkghildiyal Apr 7, 2026

Uh oh!

k8s-ci-robot commented Apr 7, 2026

Uh oh!

rbtr commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rbtr commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 27, 2026

Uh oh!

aojea commented Mar 28, 2026

Uh oh!

rbtr commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gauravkghildiyal commented Apr 1, 2026

Uh oh!

gauravkghildiyal left a comment

Choose a reason for hiding this comment

Uh oh!

gauravkghildiyal Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Apr 7, 2026

Uh oh!

rbtr commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rbtr commented Mar 27, 2026 •

edited

Loading

rbtr commented Mar 30, 2026 •

edited

Loading