⚡ Optimize inefficient overlap calculation in text splitter#4
⚡ Optimize inefficient overlap calculation in text splitter#4google-labs-jules[bot] wants to merge 6 commits into
Conversation
Optimized the `get_overlap_from_string` method in `src/text-splitter/splitter.cr` to scan from the end of the string to find the last N words, instead of splitting the entire text into an array. This reduces memory allocation and improves performance for large texts with small overlaps. * Implemented `Char::Reader` based reverse scan to find word boundaries. * Used `byte_slice` to extract only the relevant substring. * Preserved whitespace normalization behavior.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
There was a problem hiding this comment.
Pull request overview
This PR optimizes the get_overlap_from_string method in the text splitter to reduce memory allocation when calculating chunk overlaps. The optimization replaces a text.split approach that allocates memory for all words with a backward-scanning algorithm that identifies the start position of the last N words without creating intermediate arrays.
Changes:
- Replaced
text.splitbased overlap calculation with character-by-character backward scanning usingChar::Reader - Implemented word counting by scanning backwards from the end of the string
- Added early return for cases where overlap limit is 0
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Deleted the unused 'found_limit' variable from the word limit logic in the text splitter module to clean up the code.
Introduces two new benchmark scripts: one for measuring full text splitting performance in real-world scenarios, and another for comparing memory and speed of old vs. optimized overlap calculation methods. These benchmarks demonstrate improved scalability and significant memory savings for large documents.
|
Performance Results Speed Performance:
Memory Performance (per operation):
Real-world Impact For a 100,000 word document (book chapter):
Bottom Line This is an excellent optimization for RAG applications:
|
Simplified benchmarking logic by removing unnecessary variable assignments and using direct calls to split_text. Fixed minor formatting issues and improved output consistency in benchmark_full_split.cr. Updated comment formatting in benchmark_overlap.cr for consistency.
Eliminates unnecessary intermediate string variables in both character and word chunking modes by directly appending sentence and punctuation. This reduces memory allocations and improves performance, especially in character mode, while maintaining identical behavior.
Release includes performance optimizations: - Optimized overlap calculation (97-99% memory reduction) - Eliminated unnecessary string allocations in hot loops (31% memory reduction, 1.2x speedup) - Added comprehensive benchmarks and documentation Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
💡 What:
Optimized
get_overlap_from_stringto avoid splitting the entire text into an array of words when calculating overlap. Instead, it now scans backwards from the end of the string usingChar::Readerto identify the starting position of the lastNwords.🎯 Why:
The original implementation used
text.split, which allocates a new string for every word in the text and an array to hold them. For large documents with small overlaps (e.g., 1000-word chunks with 200-word overlap), this was inefficient as it processed and allocated memory for the entire text only to discard most of it. The new implementation is O(N) in time (scanning) but O(1) in auxiliary memory (excluding the final result), significantly reducing GC pressure.📊 Measured Improvement:
While I was unable to run the benchmark directly due to the lack of a Crystal compiler in the environment, the algorithmic improvement is clear:
Verified logic correctness by tracing edge cases (empty strings, fewer words than limit, whitespace handling) and ensuring they match
text.split.last(N).join(" ")behavior.PR created automatically by Jules for task 574547704945957741 started by @antarr