[BUG] [v0.0.7] similarity_ratio() in text_utils.rs returns inflated score for non-ASCII strings by dividing char-distance by byte-length

## Bug Report

**Version:** v0.0.7
**File:** `src/cortex-engine/src/text_utils.rs`

## Description

`similarity_ratio()` computes the Levenshtein distance using character-based iteration (via `levenshtein_distance` which operates on `Vec<char>`), but then divides by `a.len().max(b.len())` which is the **byte length** of the strings. For non-ASCII text (CJK, accented characters, emoji), byte length > char count, causing the ratio to be **inflated** — strings that differ by one character appear more similar than they actually are.

## Screenshot

![similarity_ratio inflates score for non-ASCII strings](https://raw.githubusercontent.com/nightmare0329/cortex-bug-evidence/main/bug_similarity_ratio_bytes.png)

## Root Cause

```rust
// src/cortex-engine/src/text_utils.rs lines 144-153
pub fn levenshtein_distance(a: &str, b: &str) -> usize {
    let a: Vec<char> = a.chars().collect();  // char-based
    let b: Vec<char> = b.chars().collect();
    // ...computes char-level edit distance...
}

pub fn similarity_ratio(a: &str, b: &str) -> f32 {
    let distance = levenshtein_distance(a, b);      // char-based distance ✓
    let max_len = a.len().max(b.len());              // BUG: BYTE length ✗
    if max_len == 0 { return 1.0; }
    1.0 - (distance as f32 / max_len as f32)        // wrong for non-ASCII
}
```

## Example

```
a = "cafe"   (4 bytes, 4 chars)
b = "café"   (5 bytes, 4 chars)  # é = 2 bytes

levenshtein_distance(a, b) = 1  (one char substitution)
Expected: 1.0 - (1/4) = 0.75
Actual:   1.0 - (1/5) = 0.80  ← INFLATED (byte length = 5 used instead of char count = 4)
```

For CJK strings (3 bytes/char), the inflation is even larger. A string differing by one character would report similarity close to 1.0 when in fact it differs by 1/N (where N is the char count).

## Impact

Any feature using `similarity_ratio()` for fuzzy matching, deduplication, or ranking will produce incorrect results for non-ASCII agent names, file paths, or user messages containing non-ASCII characters.

## Fix

```rust
pub fn similarity_ratio(a: &str, b: &str) -> f32 {
    let distance = levenshtein_distance(a, b);
    let max_len = a.chars().count().max(b.chars().count());  // char count, not byte length
    if max_len == 0 { return 1.0; }
    1.0 - (distance as f32 / max_len as f32)
}
```

**Hotkey:** 5CzBkL6CJWFa7QSFoWxqiodobsGmxD6oLxLQa24tNxvsT9dn
**UID:** 137


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] [v0.0.7] similarity_ratio() in text_utils.rs returns inflated score for non-ASCII strings by dividing char-distance by byte-length #53421

Bug Report

Description

Screenshot

Root Cause

Example

Impact

Fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] [v0.0.7] similarity_ratio() in text_utils.rs returns inflated score for non-ASCII strings by dividing char-distance by byte-length #53421

Description

Bug Report

Description

Screenshot

Root Cause

Example

Impact

Fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions