Releases: wevote-project/crystal-text-splitter
v0.2.1 - Performance Optimizations
Performance Improvements
Three major optimizations for production RAG systems:
1. Overlap Calculation (97-99% memory reduction)
- Backward char scanning instead of allocating full word arrays
- Constant memory usage regardless of document size
- O(limit) vs O(n) space complexity
2. String Allocation (31% memory reduction, 1.2x speedup)
- Eliminated unnecessary intermediate variables in hot loops
- Direct String::Builder append in character mode
- Cleaner, more maintainable code
3. True Lazy Iterator (4-5x faster for early termination)
- Fixed design flaw where iterator loaded all chunks upfront
- State machine approach (no Fibers!)
- 65-67% memory reduction vs eager evaluation
- Zero overhead for full iteration
Real-World Impact
Processing 100K word document:
- First chunk: 7.52ms → 1.78ms (4.2x faster)
- Memory: 5,197 MB → 1,781 MB (65% reduction)
Processing 1,000 docs (first 5 chunks each):
- Time: 3.6s → 1.0s (3.6x faster)
- Memory: 65% less
Benchmarks Included
Comprehensive benchmarks added in benchmarks/ directory with detailed analysis.
Breaking Changes
None - All optimizations are backward compatible!
v0.2.0 - Iterator API & Performance
🚀 New Features
Iterator API
Added memory-efficient iterator API for processing large documents:
# Block syntax - most efficient (no array allocation)
splitter.each_chunk(text) { |chunk| process(chunk) }
# Iterator with lazy evaluation
splitter.each_chunk(text).first(10).each { |chunk| ... }
# Traditional array (backward compatible)
chunks = splitter.split_text(text)⚡ Performance Improvements
- 25% faster: Process 1MB in 6.79ms (was 9ms)
- 57% less memory: Only 17.9MB/op (was 41.9MB/op)
- Uses
String::Builderfor efficient string construction - Optimized with Crystal's native
splitmethod
Benchmark Results (1MB text)
- Throughput: 147 ops/sec
- Latency: 6.79ms
- Memory: 17.9MB per operation
- Chunks: 1,249
📚 Documentation
- Added iterator API examples and usage patterns
- Updated README with performance benchmarks
- Added
examples/iterator_usage.crwith practical examples - Updated feature comparison table
🔄 Backward Compatibility
All existing code continues to work. The split_text() method returns an Array as before.
💡 Benefits
- Memory efficiency: Process huge documents without loading all chunks into memory
- Streaming capable: Lazy evaluation with iterator chains
- Production ready: All 25 tests passing
- Type safe: Full Crystal type safety
Full Changelog: v0.1.1...v0.2.0
v0.1.1: CI Fixes
Crystal Text Splitter v0.1.1
Bug fix release that ensures all CI checks pass.
What's Fixed
- ✅ CI Build Job Removed: Removed unnecessary build job that was failing (libraries don't need executable targets)
- ✅ Code Formatting: Ran
crystal tool formaton all files to pass formatting checks - ✅ Ameba Linting: Removed leftover scaffold file that contained TODO comments
- ✅ All CI Checks Pass: Ubuntu/macOS with latest/nightly Crystal all passing
Installation
dependencies:
text-splitter:
github: wevote-project/crystal-text-splitter
version: ~> 0.1.1Upgrading from v0.1.0
No breaking changes - drop-in replacement. Simply update your shard.yml to point to ~> 0.1.1.
Full Changelog: v0.1.0...v0.1.1
v0.1.0: Initial Release
Crystal Text Splitter v0.1.0
Intelligent text chunking for RAG (Retrieval-Augmented Generation) and LLM applications.
Features
- Two splitting modes: Character-based and word-based chunking
- Configurable overlap: Preserve context between chunks
- Sentence boundary respect: Keeps sentences intact when possible
- Production-tested: Extracted from the bills-rag-system project
- Comprehensive test coverage: 22 test cases covering all functionality
Installation
Add this to your application's shard.yml:
dependencies:
text-splitter:
github: wevote-project/crystal-text-splitter
version: ~> 0.1.0Quick Start
require "text-splitter"
# Character-based splitting
splitter = Text::Splitter.new(
chunk_size: 1000,
chunk_overlap: 200
)
chunks = splitter.split_text("Your long document...")
# Word-based splitting
word_splitter = Text::Splitter.new(
chunk_size: 280,
chunk_overlap: 50,
mode: Text::Splitter::ChunkMode::Words
)What's New
- Initial release with character and word splitting modes
- Full test coverage
- Complete documentation with examples
- CI/CD pipeline for Ubuntu and macOS
Perfect for building semantic search, RAG pipelines, and LLM applications in Crystal!