Ref blockmgr4#1861
Conversation
There was a problem hiding this comment.
Code Review
This pull request re-enables and refactors the hierarchical KV cache transfer and block manager pool, which were temporarily disabled during a block-manager refactor. It introduces platform-independent abstractions like BatchMemcpy and LayerSynchronizer (with NPU implementations), updates various KVCacheImpl subclasses to support host cache allocation, and integrates these components back into WorkerImpl and LLMEngine. The code review feedback focuses on enforcing repository style guides, including reserving vector capacity, using fixed-width integers instead of plain int, annotating constant arguments with parameter names, avoiding relative include paths, avoiding auto for primitive types, and adding defensive bounds checks for tensor indexing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| std::vector<HierarchyKVCacheTransfer::LayerBatchRange> ranges; | ||
| if (num_layers <= 0) { | ||
| return ranges; | ||
| } | ||
|
|
||
| uint32_t layers_per_batch = | ||
| requested_batches == 0 | ||
| ? static_cast<uint32_t>(num_layers) | ||
| : static_cast<uint32_t>(num_layers) / requested_batches; | ||
| layers_per_batch = std::max<uint32_t>(layers_per_batch, 1); | ||
|
|
||
| constexpr uint64_t MBUF_SIZE = 128 * 1024 * 1024; | ||
| constexpr uint32_t BATCH_COPY_MAX_SIZE = 4096; | ||
| constexpr uint32_t TIMEOUT_S = 60; // second | ||
| constexpr uint32_t TIMEOUT_MS = 60000; // millisecond | ||
| for (int64_t begin = 0; begin < num_layers; begin += layers_per_batch) { |
There was a problem hiding this comment.
According to the repository style guide, we should always call reserve() before filling a std::vector when the size is known or can be estimated. In this case, the size of ranges can be estimated as (num_layers + layers_per_batch - 1) / layers_per_batch.
std::vector<HierarchyKVCacheTransfer::LayerBatchRange> ranges;
if (num_layers <= 0) {
return ranges;
}
uint32_t layers_per_batch =
requested_batches == 0
? static_cast<uint32_t>(num_layers)
: static_cast<uint32_t>(num_layers) / requested_batches;
layers_per_batch = std::max<uint32_t>(layers_per_batch, 1);
ranges.reserve((static_cast<size_t>(num_layers) + layers_per_batch - 1) / layers_per_batch);
for (int64_t begin = 0; begin < num_layers; begin += layers_per_batch) {References
- Always reserve() before filling a std::vector when the size is known or can be estimated. (link)
| for (int i = 0; i < load_threadpool_->size() + offload_threadpool_->size(); | ||
| ++i) { |
There was a problem hiding this comment.
According to the repository style guide, we should use fixed-width integers (int32_t, int64_t) or standard unsigned types like size_t instead of plain int. Since size() returns size_t, using size_t for the loop counter avoids signed/unsigned comparison warnings.
| for (int i = 0; i < load_threadpool_->size() + offload_threadpool_->size(); | |
| ++i) { | |
| for (size_t i = 0; i < load_threadpool_->size() + offload_threadpool_->size(); | |
| ++i) { |
References
- Use fixed-width integers (int32_t, int64_t) instead of plain int, unless the API you are calling explicitly requires int. (link)
| .enable_xtensor(false) | ||
| .enable_raw_device_allocator(false) | ||
| .host_blocks_factor(options_.host_blocks_factor()); | ||
| #if defined(USE_NPU) | ||
| aclrtHostUnregister(page_aligned_data_); | ||
| host_opts.enable_kv_cache_huge_page_allocator(false); |
There was a problem hiding this comment.
According to the repository style guide, constant arguments must be annotated with a comment indicating the parameter name when calling functions or constructors.
| .enable_xtensor(false) | |
| .enable_raw_device_allocator(false) | |
| .host_blocks_factor(options_.host_blocks_factor()); | |
| #if defined(USE_NPU) | |
| aclrtHostUnregister(page_aligned_data_); | |
| host_opts.enable_kv_cache_huge_page_allocator(false); | |
| .enable_xtensor(/*enable_xtensor=*/false) | |
| .enable_raw_device_allocator(/*enable_raw_device_allocator=*/false) | |
| .host_blocks_factor(options_.host_blocks_factor()); | |
| #if defined(USE_NPU) | |
| host_opts.enable_kv_cache_huge_page_allocator(/*enable_kv_cache_huge_page_allocator=*/false); | |
| #endif |
References
- Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors. (link)
| CHECK_LT(host_block_id, host_tensor.size(0)) | ||
| << "host block id out of range."; | ||
| torch::Tensor device_block = device_tensor[device_block_id]; | ||
| torch::Tensor host_block_layer = | ||
| host_tensor[host_block_id][static_cast<int64_t>(layer_slot)]; |
There was a problem hiding this comment.
For defensive programming and correctness, we should check that device_block_id is non-negative and within the bounds of device_tensor.size(0) before accessing it, similar to how host_block_id is checked.
CHECK_LT(host_block_id, host_tensor.size(0))
<< "host block id out of range.";
CHECK_GE(device_block_id, 0) << "device block id must be non-negative.";
CHECK_LT(device_block_id, device_tensor.size(0))
<< "device block id out of range.";
torch::Tensor device_block = device_tensor[device_block_id];
torch::Tensor host_block_layer =
host_tensor[host_block_id][static_cast<int64_t>(layer_slot)];References
- Ensure appropriate bounds and validity checks exist before accessing array/tensor elements.
| #include <algorithm> | ||
|
|
||
| #include "block_manager_impl.h" | ||
| #include "composite_block_manager.h" |
There was a problem hiding this comment.
According to the repository style guide, relative paths in #include are not allowed. Always use project-root-relative paths.
| #include "composite_block_manager.h" | |
| #include "core/framework/block/composite_block_manager.h" |
References
- No relative paths in #include. Always use project-root-relative paths. (link)
| auto ret = aclrtRecordEvent(*impl_.get_event(layer_index), | ||
| stream->get_stream()->stream()); |
There was a problem hiding this comment.
According to the repository style guide, we should not use auto for simple/primitive types. aclError is a simple/primitive type (typedef of uint32_t or enum), so it should be explicitly typed.
| auto ret = aclrtRecordEvent(*impl_.get_event(layer_index), | |
| stream->get_stream()->stream()); | |
| const aclError ret = aclrtRecordEvent(*impl_.get_event(layer_index), | |
| stream->get_stream()->stream()); |
References
- Do not use auto for simple/primitive types. (link)
| CHECK(base_ptr != MAP_FAILED) | ||
| << "Failed to mmap host memory, size=" << total_bytes; | ||
| if (mlock(base_ptr, total_bytes) != 0) { | ||
| const int err = errno; |
There was a problem hiding this comment.
According to the repository style guide, we should use fixed-width integers (int32_t, int64_t) instead of plain int.
| const int err = errno; | |
| const int32_t err = errno; |
References
- Use fixed-width integers (int32_t, int64_t) instead of plain int, unless the API you are calling explicitly requires int. (link)
| count, | ||
| attrs, | ||
| attrs_indexes, | ||
| 1, |
There was a problem hiding this comment.
According to the repository style guide, constant arguments must be annotated with a comment indicating the parameter name when calling functions or constructors.
| 1, | |
| /*attrsCount=*/1, |
References
- Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors. (link)
| .host_blocks_factor(options_.host_blocks_factor()) | ||
| .layers_wise_copy_batchs(options_.layers_wise_copy_batchs()) | ||
| .enable_mla(options_.enable_mla()) | ||
| .enable_kvcache_store(false); |
There was a problem hiding this comment.
According to the repository style guide, constant arguments must be annotated with a comment indicating the parameter name when calling functions or constructors.
| .enable_kvcache_store(false); | |
| .enable_kvcache_store(/*enable_kvcache_store=*/false); |
References
- Annotate constant arguments with a comment indicating the parameter name when calling functions or constructors. (link)
Description
Related Issues
Change Type
Pull Request Checklist
Thank you for contributing to xLLM. Before requesting review, please make sure the following items are complete.
PR Title and Commit Messages
<type>: <subject>.Pre-commit Checks
pre-commitby runningpip install pre-commitor an equivalent command.pre-commit install.pre-commit run --all-filesand fixed any reported issues.Self Review
.agents/skills/code-review/references/custom-code-style.md, especially code written or assisted by AI.mainbranch.Build and Test Coverage
python setup.py build testhas passed on a CUDA machine.python setup.py build testhas passed on an NPU machine.python setup.py build testhas passed on an MLU machine.Reviewer Notes