Skip to content

Releases: wevote-project/crystal-text-splitter

v0.2.1 - Performance Optimizations

20 Jan 03:56

Choose a tag to compare

Performance Improvements

Three major optimizations for production RAG systems:

1. Overlap Calculation (97-99% memory reduction)

  • Backward char scanning instead of allocating full word arrays
  • Constant memory usage regardless of document size
  • O(limit) vs O(n) space complexity

2. String Allocation (31% memory reduction, 1.2x speedup)

  • Eliminated unnecessary intermediate variables in hot loops
  • Direct String::Builder append in character mode
  • Cleaner, more maintainable code

3. True Lazy Iterator (4-5x faster for early termination)

  • Fixed design flaw where iterator loaded all chunks upfront
  • State machine approach (no Fibers!)
  • 65-67% memory reduction vs eager evaluation
  • Zero overhead for full iteration

Real-World Impact

Processing 100K word document:

  • First chunk: 7.52ms → 1.78ms (4.2x faster)
  • Memory: 5,197 MB → 1,781 MB (65% reduction)

Processing 1,000 docs (first 5 chunks each):

  • Time: 3.6s → 1.0s (3.6x faster)
  • Memory: 65% less

Benchmarks Included

Comprehensive benchmarks added in benchmarks/ directory with detailed analysis.

Breaking Changes

None - All optimizations are backward compatible!

v0.2.0 - Iterator API & Performance

24 Nov 06:38

Choose a tag to compare

🚀 New Features

Iterator API

Added memory-efficient iterator API for processing large documents:

# Block syntax - most efficient (no array allocation)
splitter.each_chunk(text) { |chunk| process(chunk) }

# Iterator with lazy evaluation
splitter.each_chunk(text).first(10).each { |chunk| ... }

# Traditional array (backward compatible)
chunks = splitter.split_text(text)

⚡ Performance Improvements

  • 25% faster: Process 1MB in 6.79ms (was 9ms)
  • 57% less memory: Only 17.9MB/op (was 41.9MB/op)
  • Uses String::Builder for efficient string construction
  • Optimized with Crystal's native split method

Benchmark Results (1MB text)

  • Throughput: 147 ops/sec
  • Latency: 6.79ms
  • Memory: 17.9MB per operation
  • Chunks: 1,249

📚 Documentation

  • Added iterator API examples and usage patterns
  • Updated README with performance benchmarks
  • Added examples/iterator_usage.cr with practical examples
  • Updated feature comparison table

🔄 Backward Compatibility

All existing code continues to work. The split_text() method returns an Array as before.

💡 Benefits

  1. Memory efficiency: Process huge documents without loading all chunks into memory
  2. Streaming capable: Lazy evaluation with iterator chains
  3. Production ready: All 25 tests passing
  4. Type safe: Full Crystal type safety

Full Changelog: v0.1.1...v0.2.0

v0.1.1: CI Fixes

24 Nov 04:02

Choose a tag to compare

Crystal Text Splitter v0.1.1

Bug fix release that ensures all CI checks pass.

What's Fixed

  • CI Build Job Removed: Removed unnecessary build job that was failing (libraries don't need executable targets)
  • Code Formatting: Ran crystal tool format on all files to pass formatting checks
  • Ameba Linting: Removed leftover scaffold file that contained TODO comments
  • All CI Checks Pass: Ubuntu/macOS with latest/nightly Crystal all passing

Installation

dependencies:
  text-splitter:
    github: wevote-project/crystal-text-splitter
    version: ~> 0.1.1

Upgrading from v0.1.0

No breaking changes - drop-in replacement. Simply update your shard.yml to point to ~> 0.1.1.

Full Changelog: v0.1.0...v0.1.1

v0.1.0: Initial Release

24 Nov 03:33

Choose a tag to compare

Crystal Text Splitter v0.1.0

Intelligent text chunking for RAG (Retrieval-Augmented Generation) and LLM applications.

Features

  • Two splitting modes: Character-based and word-based chunking
  • Configurable overlap: Preserve context between chunks
  • Sentence boundary respect: Keeps sentences intact when possible
  • Production-tested: Extracted from the bills-rag-system project
  • Comprehensive test coverage: 22 test cases covering all functionality

Installation

Add this to your application's shard.yml:

dependencies:
  text-splitter:
    github: wevote-project/crystal-text-splitter
    version: ~> 0.1.0

Quick Start

require "text-splitter"

# Character-based splitting
splitter = Text::Splitter.new(
  chunk_size: 1000,
  chunk_overlap: 200
)

chunks = splitter.split_text("Your long document...")

# Word-based splitting
word_splitter = Text::Splitter.new(
  chunk_size: 280,
  chunk_overlap: 50,
  mode: Text::Splitter::ChunkMode::Words
)

What's New

  • Initial release with character and word splitting modes
  • Full test coverage
  • Complete documentation with examples
  • CI/CD pipeline for Ubuntu and macOS

Perfect for building semantic search, RAG pipelines, and LLM applications in Crystal!