Skip to content

Ref blockmgr4#1861

Draft
Kang-Meng wants to merge 3 commits into
jd-opensource:mainfrom
Kang-Meng:ref_blockmgr4
Draft

Ref blockmgr4#1861
Kang-Meng wants to merge 3 commits into
jd-opensource:mainfrom
Kang-Meng:ref_blockmgr4

Conversation

@Kang-Meng

Copy link
Copy Markdown
Collaborator

Description

Related Issues

Change Type

  • Bug fix
  • New feature
  • Performance improvement
  • Refactor
  • Documentation
  • Test
  • Build or CI

Pull Request Checklist

Thank you for contributing to xLLM. Before requesting review, please make sure the following items are complete.

PR Title and Commit Messages

  • The PR title and each commit message follow the xLLM commit format: <type>: <subject>.

Allowed types: feat, bugfix, docs, test, refactor, chore, style, revert, perf, model, build, release.
The subject should use clear English, start with a verb, include at least 4 words, and end with ..

Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit or an equivalent command.
  • I have installed the hooks with pre-commit install.
  • I have run pre-commit run --all-files and fixed any reported issues.

If you are unsure how to set up pre-commit, see the pre-commit documentation.

Self Review

  • I have self-reviewed the code according to .agents/skills/code-review/references/custom-code-style.md, especially code written or assisted by AI.
  • I have rebased this PR onto the latest main branch.

Build and Test Coverage

  • Tests have been added or updated as needed.
  • CUDA: python setup.py build test has passed on a CUDA machine.
  • NPU: python setup.py build test has passed on an NPU machine.
  • MLU: python setup.py build test has passed on an MLU machine.

Reviewer Notes

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request re-enables and refactors the hierarchical KV cache transfer and block manager pool, which were temporarily disabled during a block-manager refactor. It introduces platform-independent abstractions like BatchMemcpy and LayerSynchronizer (with NPU implementations), updates various KVCacheImpl subclasses to support host cache allocation, and integrates these components back into WorkerImpl and LLMEngine. The code review feedback focuses on enforcing repository style guides, including reserving vector capacity, using fixed-width integers instead of plain int, annotating constant arguments with parameter names, avoiding relative include paths, avoiding auto for primitive types, and adding defensive bounds checks for tensor indexing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +30 to +41
std::vector<HierarchyKVCacheTransfer::LayerBatchRange> ranges;
if (num_layers <= 0) {
return ranges;
}

uint32_t layers_per_batch =
requested_batches == 0
? static_cast<uint32_t>(num_layers)
: static_cast<uint32_t>(num_layers) / requested_batches;
layers_per_batch = std::max<uint32_t>(layers_per_batch, 1);

constexpr uint64_t MBUF_SIZE = 128 * 1024 * 1024;
constexpr uint32_t BATCH_COPY_MAX_SIZE = 4096;
constexpr uint32_t TIMEOUT_S = 60; // second
constexpr uint32_t TIMEOUT_MS = 60000; // millisecond
for (int64_t begin = 0; begin < num_layers; begin += layers_per_batch) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the repository style guide, we should always call reserve() before filling a std::vector when the size is known or can be estimated. In this case, the size of ranges can be estimated as (num_layers + layers_per_batch - 1) / layers_per_batch.

  std::vector<HierarchyKVCacheTransfer::LayerBatchRange> ranges;
  if (num_layers <= 0) {
    return ranges;
  }

  uint32_t layers_per_batch =
      requested_batches == 0
          ? static_cast<uint32_t>(num_layers)
          : static_cast<uint32_t>(num_layers) / requested_batches;
  layers_per_batch = std::max<uint32_t>(layers_per_batch, 1);

  ranges.reserve((static_cast<size_t>(num_layers) + layers_per_batch - 1) / layers_per_batch);
  for (int64_t begin = 0; begin < num_layers; begin += layers_per_batch) {
References
  1. Always reserve() before filling a std::vector when the size is known or can be estimated. (link)

Comment on lines +141 to +142
for (int i = 0; i < load_threadpool_->size() + offload_threadpool_->size();
++i) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the repository style guide, we should use fixed-width integers (int32_t, int64_t) or standard unsigned types like size_t instead of plain int. Since size() returns size_t, using size_t for the loop counter avoids signed/unsigned comparison warnings.

Suggested change
for (int i = 0; i < load_threadpool_->size() + offload_threadpool_->size();
++i) {
for (size_t i = 0; i < load_threadpool_->size() + offload_threadpool_->size();
++i) {
References
  1. Use fixed-width integers (int32_t, int64_t) instead of plain int, unless the API you are calling explicitly requires int. (link)

Comment on lines +194 to +198
.enable_xtensor(false)
.enable_raw_device_allocator(false)
.host_blocks_factor(options_.host_blocks_factor());
#if defined(USE_NPU)
aclrtHostUnregister(page_aligned_data_);
host_opts.enable_kv_cache_huge_page_allocator(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the repository style guide, constant arguments must be annotated with a comment indicating the parameter name when calling functions or constructors.

Suggested change
.enable_xtensor(false)
.enable_raw_device_allocator(false)
.host_blocks_factor(options_.host_blocks_factor());
#if defined(USE_NPU)
aclrtHostUnregister(page_aligned_data_);
host_opts.enable_kv_cache_huge_page_allocator(false);
.enable_xtensor(/*enable_xtensor=*/false)
.enable_raw_device_allocator(/*enable_raw_device_allocator=*/false)
.host_blocks_factor(options_.host_blocks_factor());
#if defined(USE_NPU)
host_opts.enable_kv_cache_huge_page_allocator(/*enable_kv_cache_huge_page_allocator=*/false);
#endif
References
  1. Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors. (link)

Comment on lines +274 to +278
CHECK_LT(host_block_id, host_tensor.size(0))
<< "host block id out of range.";
torch::Tensor device_block = device_tensor[device_block_id];
torch::Tensor host_block_layer =
host_tensor[host_block_id][static_cast<int64_t>(layer_slot)];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

For defensive programming and correctness, we should check that device_block_id is non-negative and within the bounds of device_tensor.size(0) before accessing it, similar to how host_block_id is checked.

        CHECK_LT(host_block_id, host_tensor.size(0))
            << "host block id out of range.";
        CHECK_GE(device_block_id, 0) << "device block id must be non-negative.";
        CHECK_LT(device_block_id, device_tensor.size(0))
            << "device block id out of range.";
        torch::Tensor device_block = device_tensor[device_block_id];
        torch::Tensor host_block_layer =
            host_tensor[host_block_id][static_cast<int64_t>(layer_slot)];
References
  1. Ensure appropriate bounds and validity checks exist before accessing array/tensor elements.

#include <algorithm>

#include "block_manager_impl.h"
#include "composite_block_manager.h"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the repository style guide, relative paths in #include are not allowed. Always use project-root-relative paths.

Suggested change
#include "composite_block_manager.h"
#include "core/framework/block/composite_block_manager.h"
References
  1. No relative paths in #include. Always use project-root-relative paths. (link)

Comment on lines +39 to +40
auto ret = aclrtRecordEvent(*impl_.get_event(layer_index),
stream->get_stream()->stream());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the repository style guide, we should not use auto for simple/primitive types. aclError is a simple/primitive type (typedef of uint32_t or enum), so it should be explicitly typed.

Suggested change
auto ret = aclrtRecordEvent(*impl_.get_event(layer_index),
stream->get_stream()->stream());
const aclError ret = aclrtRecordEvent(*impl_.get_event(layer_index),
stream->get_stream()->stream());
References
  1. Do not use auto for simple/primitive types. (link)

CHECK(base_ptr != MAP_FAILED)
<< "Failed to mmap host memory, size=" << total_bytes;
if (mlock(base_ptr, total_bytes) != 0) {
const int err = errno;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the repository style guide, we should use fixed-width integers (int32_t, int64_t) instead of plain int.

Suggested change
const int err = errno;
const int32_t err = errno;
References
  1. Use fixed-width integers (int32_t, int64_t) instead of plain int, unless the API you are calling explicitly requires int. (link)

count,
attrs,
attrs_indexes,
1,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the repository style guide, constant arguments must be annotated with a comment indicating the parameter name when calling functions or constructors.

Suggested change
1,
/*attrsCount=*/1,
References
  1. Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors. (link)

.host_blocks_factor(options_.host_blocks_factor())
.layers_wise_copy_batchs(options_.layers_wise_copy_batchs())
.enable_mla(options_.enable_mla())
.enable_kvcache_store(false);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

According to the repository style guide, constant arguments must be annotated with a comment indicating the parameter name when calling functions or constructors.

Suggested change
.enable_kvcache_store(false);
.enable_kvcache_store(/*enable_kvcache_store=*/false);
References
  1. Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors. (link)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant