Skip to content

[BUG] [v0.0.7] similarity_ratio() in text_utils.rs returns inflated score for non-ASCII strings by dividing char-distance by byte-length #53421

Description

@nightmare0329

Bug Report

Version: v0.0.7
File: src/cortex-engine/src/text_utils.rs

Description

similarity_ratio() computes the Levenshtein distance using character-based iteration (via levenshtein_distance which operates on Vec<char>), but then divides by a.len().max(b.len()) which is the byte length of the strings. For non-ASCII text (CJK, accented characters, emoji), byte length > char count, causing the ratio to be inflated — strings that differ by one character appear more similar than they actually are.

Screenshot

similarity_ratio inflates score for non-ASCII strings

Root Cause

// src/cortex-engine/src/text_utils.rs lines 144-153
pub fn levenshtein_distance(a: &str, b: &str) -> usize {
    let a: Vec<char> = a.chars().collect();  // char-based
    let b: Vec<char> = b.chars().collect();
    // ...computes char-level edit distance...
}

pub fn similarity_ratio(a: &str, b: &str) -> f32 {
    let distance = levenshtein_distance(a, b);      // char-based distance ✓
    let max_len = a.len().max(b.len());              // BUG: BYTE length ✗
    if max_len == 0 { return 1.0; }
    1.0 - (distance as f32 / max_len as f32)        // wrong for non-ASCII
}

Example

a = "cafe"   (4 bytes, 4 chars)
b = "café"   (5 bytes, 4 chars)  # é = 2 bytes

levenshtein_distance(a, b) = 1  (one char substitution)
Expected: 1.0 - (1/4) = 0.75
Actual:   1.0 - (1/5) = 0.80  ← INFLATED (byte length = 5 used instead of char count = 4)

For CJK strings (3 bytes/char), the inflation is even larger. A string differing by one character would report similarity close to 1.0 when in fact it differs by 1/N (where N is the char count).

Impact

Any feature using similarity_ratio() for fuzzy matching, deduplication, or ranking will produce incorrect results for non-ASCII agent names, file paths, or user messages containing non-ASCII characters.

Fix

pub fn similarity_ratio(a: &str, b: &str) -> f32 {
    let distance = levenshtein_distance(a, b);
    let max_len = a.chars().count().max(b.chars().count());  // char count, not byte length
    if max_len == 0 { return 1.0; }
    1.0 - (distance as f32 / max_len as f32)
}

Hotkey: 5CzBkL6CJWFa7QSFoWxqiodobsGmxD6oLxLQa24tNxvsT9dn
UID: 137

Metadata

Metadata

Assignees

No one assigned

    Labels

    invalidThis doesn't seem right

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions