Skip to content

Commit e884bb6

Browse files
ahmedqasidclaudeLinkinStars
authored
fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration (#1526)
# fix: avoid `topic` fallback for non-Latin titles via pragmatic ASCII transliteration > **Scope update (in response to review):** this PR is intentionally broader than its original "Arabic-only" framing. The implementation changes URL slug generation for **every non-Latin, non-CJK script** that `slugify` previously stripped — see *Scope* below for the explicit list. The goal is *not* linguistically correct romanization; it is "avoid collapsing to `/topic` by producing a usable ASCII slug." ## What this PR is (and isn't) **Goal:** when a question title contains characters outside Basic Latin / Latin Extended / CJK Han, generate a URL slug that is a deterministic ASCII approximation instead of letting `slugify` strip everything and falling back to the literal `"topic"`. **Non-goal:** this is *not* a linguistically correct multi-language romanizer. The output is a machine-acceptable ASCII slug, not what a native speaker would choose. For example, `こんにちは` → `konnichiha` (not the more natural `kon'nichiwa`), `ไทย` → `aithy` (not `thai`). Treat the slug as an opaque, stable, indexable identifier — the path-after-`/questions/<id>/` is for SEO and shareability, the canonical reference is always the ID. ## The bug Pure non-Latin titles previously got stripped by `slugify.Slugify`, hit the empty-result fallback in `htmltext.UrlTitle`, and collapsed to the literal slug `"topic"`. On a live multilingual site, every Arabic / Thai / Japanese-hiragana / Korean / Hebrew / Cyrillic question ended up at `/questions/<id>/topic`. ## The fix `UrlTitle()` gets a `convertNonLatin` pre-step that mirrors the existing `convertChinese` pre-step pattern, using `github.com/mozillazg/go-unidecode` (same author as `go-pinyin` already in the repo, to minimise new-dep friction). ``` UrlTitle(title) → convertChinese(title) // pre-existing: Han-block → pinyin → convertNonLatin(title) // NEW: detect non-Latin letters → unidecode to ASCII → clearEmoji / slugify / url.QueryEscape / cutLongTitle (unchanged) ``` The non-Latin detector skips ASCII, Latin-1 Supplement, Latin Extended-A/B, and CJK Han. Inputs that hit none of those non-Latin letter categories short-circuit and return unchanged, so Latin-only and Chinese-only inputs remain byte-identical (pinned by tests). ## Scope — what scripts are affected This PR changes behavior for **any** title containing letters in scripts that `slugify` doesn't handle. Confirmed by tests in `pkg/htmltext/htmltext_test.go`: | Script | Example title | Before | After | | --- | --- | --- | --- | | Arabic | `كيف حالك` | `topic` | `kyf-hlk` | | Mixed Latin + Arabic | `مرحبا hello` | `hello` | `mrhb-hello` | | Thai | `ไทย ไทย` | `topic` | `aithy-aithy` | | Japanese hiragana | `こんにちは` | `topic` | `konnichiha` | | Korean | `안녕하세요` | `topic` | `annyeonghaseyo` | | Hebrew | `שלום עולם` | `topic` | `shlvm-vlm` | | Cyrillic | `Привет мир` | `topic` | `privet-mir` | **Unchanged:** | Case | Behavior | | --- | --- | | Pure Latin (`hello world`) | unchanged → `hello-world` | | Pure Chinese (`这是一个,标题,title`) | unchanged → `zhe-shi-yi-ge-biao-ti` (pinyin path) | | Japanese with Han-block kanji (`日本`) | unchanged → `ri-ben` (caught by pre-existing pinyin path; treated as Chinese reading, not Japanese — a pre-existing limitation, **not** introduced by this PR) | | Emoji only (`😂😂😂`) | unchanged → `topic` | | Empty / whitespace | unchanged → `topic` | ## Transliteration quality — explicit acknowledgement `go-unidecode` is a generic Unicode → ASCII approximation. It is **not** a per-language romanization library. Specifically: - It will pick *one* approximation per codepoint regardless of language context. `ใ` → `ai` (Thai romanization is `i` or `ai` depending on standard), `한` → `han`, `語` → `Yu` (Chinese pinyin reading even when used in Japanese), etc. - The result is *good enough* to be a stable, URL-safe, human-recognizable handle, but speakers of the source language will not consider it "correct." - It is deterministic, so the same title always produces the same slug — important since `url_title` is recomputed on every request. If maintainers prefer to scope this PR more narrowly (e.g. Arabic only, and reject Thai/Hebrew/Cyrillic/etc.), the detector in `containsNonLatin` can be tightened to specific Unicode blocks — but that means the other scripts continue to collapse to `topic`, which is the bug we're trying to fix. I'd argue the broader fix is preferable to a piecemeal one, but happy to narrow if you want. ## Live deployment / real-world verification This patch has been running in production on **[ask.namasoft.com](https://ask.namasoft.com)** (an Apache Answer instance we operate) since deployment, built directly from this branch via `docker compose build`. The site hosts Arabic-language questions, so the fix exercises the affected code path on every page load. Sample question URL on the deployed instance: > `https://ask.namasoft.com/questions/10010000000000115` The slug in the URL is the transliterated Arabic title rather than `topic`. No data migration was needed since `url_title` is computed on every request from `Title` and never persisted (see *Why this is safe to ship* below). ## Admin-configurable The transliteration is gated by a package-level `atomic.Bool` (default **on**, since the current behavior is objectively broken for affected users): - `htmltext.SetTransliterateNonLatin(enabled bool)` - `htmltext.IsTransliterateNonLatinEnabled() bool` This is deliberately the minimum surface needed to satisfy "the setting must be readable from `UrlTitle()`". A follow-up PR can add an admin UI section that calls `SetTransliterateNonLatin` on save and on startup, without having to re-plumb every `htmltext.UrlTitle` call site through `context.Context`. **Default choice — please confirm:** I picked **default-on** because the existing `topic` behavior is a bug for affected users. If you'd prefer default-off for strict backward compat on existing installs, flip the `init()` in `pkg/htmltext/htmltext.go` to `Store(false)` and surface the toggle as opt-in. ## Why this is safe to ship - `url_title` is **not** a persisted column. It's not on the `Question` entity in `internal/entity/question_entity.go`, no migration has ever added/dropped it, and every call site (`question_service.go`, `revision_service.go`, `vote_service.go`, search/report/review/rank/comment services, controllers, repos) recomputes it from `Title` at response-build time via `htmltext.UrlTitle(...)`. - That means the fix is read-only: existing rows light up with correct slugs on the next request, with no migration and no data rewrite. - Rollback is just redeploying the prior image; nothing on disk changes. ## Test coverage `pkg/htmltext/htmltext_test.go`: - **`TestUrlTitleTable`** — table-driven, one case per affected script (the full matrix above), plus: - `empty` → `topic` - `pure latin unchanged` → byte-identical to pre-fix - `pure chinese unchanged` → byte-identical to pre-fix (pins existing pinyin behavior) - `japanese kanji goes through pinyin path unchanged` → documents the pre-existing Han-block limitation - `emoji only falls back to topic` → unchanged - `long arabic truncates at cutLongTitle boundary` → exercises the 150-byte cap and UTF-8 boundary safety - **`TestUrlTitleTransliterationToggle`** — with the toggle off, non-Latin titles collapse to `topic` (pre-fix behavior); with it on, they transliterate. - Existing `TestUrlTitle` left untouched. Test plan for reviewers: - [ ] `go test ./pkg/htmltext/...` — all pass - [ ] Visit the live sample URL above and confirm slug is transliterated, not `topic` - [ ] Verify Chinese / Latin / emoji-only / empty behavior is byte-identical to `main` (covered by table tests) ## Out of scope (intentionally) - No admin UI / site setting plumbing in this PR — see *Admin-configurable* above. Happy to do the React `Non-Latin Languages Handling` admin page + `SiteType` + service / controller / migration in a follow-up if maintainers want it. - No change to the `"topic"` empty-result fallback. - No plugin interface for slug generation — mirrored the existing `convertChinese` pre-step pattern instead. - No per-language romanization library — this is an explicit non-goal; see *Transliteration quality* above. ## Issues / discussion I didn't find an existing upstream issue covering this — happy to be pointed at one if there is. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: LinkinStars <linkinstar@foxmail.com>
1 parent 68085ab commit e884bb6

5 files changed

Lines changed: 178 additions & 0 deletions

File tree

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
The MIT License (MIT)
2+
3+
Copyright (c) 2016 mozillazg
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

go.mod

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ require (
4343
github.com/mark3labs/mcp-go v0.43.2
4444
github.com/microcosm-cc/bluemonday v1.0.27
4545
github.com/mozillazg/go-pinyin v0.20.0
46+
github.com/mozillazg/go-unidecode v0.2.0
4647
github.com/ory/dockertest/v3 v3.11.0
4748
github.com/robfig/cron/v3 v3.0.1
4849
github.com/sashabaranov/go-openai v1.41.2

go.sum

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -462,6 +462,8 @@ github.com/modern-go/reflect2 v1.0.2 h1:xBagoLtFs94CBntxluKeaWgTMpvLxC4ur3nMaC9G
462462
github.com/modern-go/reflect2 v1.0.2/go.mod h1:yWuevngMOJpCy52FWWMvUC8ws7m/LJsjYzDa0/r8luk=
463463
github.com/mozillazg/go-pinyin v0.20.0 h1:BtR3DsxpApHfKReaPO1fCqF4pThRwH9uwvXzm+GnMFQ=
464464
github.com/mozillazg/go-pinyin v0.20.0/go.mod h1:iR4EnMMRXkfpFVV5FMi4FNB6wGq9NV6uDWbUuPhP4Yc=
465+
github.com/mozillazg/go-unidecode v0.2.0 h1:vFGEzAH9KSwyWmXCOblazEWDh7fOkpmy/Z4ArmamSUc=
466+
github.com/mozillazg/go-unidecode v0.2.0/go.mod h1:zB48+/Z5toiRolOZy9ksLryJ976VIwmDmpQ2quyt1aA=
465467
github.com/mwitkow/go-conntrack v0.0.0-20161129095857-cc309e4a2223/go.mod h1:qRWi+5nqEBWmkhHvq77mSJWrCKwh8bxhgT7d/eI7P4U=
466468
github.com/nats-io/jwt v0.3.0/go.mod h1:fRYCDE99xlTsqUzISS1Bi75UBJ6ljOJQOAAu5VglpSg=
467469
github.com/nats-io/jwt v0.3.2/go.mod h1:/euKqTS1ZD+zzjYrY7pseZrTtWQSjujC7xjPc8wL6eU=

pkg/htmltext/htmltext.go

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,16 @@ import (
2525
"net/url"
2626
"regexp"
2727
"strings"
28+
"sync/atomic"
29+
"unicode"
2830
"unicode/utf8"
2931

3032
"github.com/Machiel/slugify"
3133
"github.com/apache/answer/pkg/checker"
3234
"github.com/apache/answer/pkg/converter"
3335
strip "github.com/grokify/html-strip-tags-go"
3436
"github.com/mozillazg/go-pinyin"
37+
"github.com/mozillazg/go-unidecode"
3538
)
3639

3740
var (
@@ -47,8 +50,27 @@ var (
4750
"\r", " ",
4851
"\t", " ",
4952
)
53+
54+
// Without this, pure non-Latin titles (Arabic, Cyrillic, Hebrew, ...) get
55+
// stripped by slugify and collapse to the "topic" fallback. Chinese is
56+
// handled separately by convertChinese.
57+
transliterateNonLatin atomic.Bool
5058
)
5159

60+
func init() {
61+
transliterateNonLatin.Store(true)
62+
}
63+
64+
// SetTransliterateNonLatin toggles non-Latin script transliteration for URL slugs.
65+
func SetTransliterateNonLatin(enabled bool) {
66+
transliterateNonLatin.Store(enabled)
67+
}
68+
69+
// IsTransliterateNonLatinEnabled reports whether non-Latin transliteration is on.
70+
func IsTransliterateNonLatinEnabled() bool {
71+
return transliterateNonLatin.Load()
72+
}
73+
5274
// ClearText clear HTML, get the clear text
5375
func ClearText(html string) string {
5476
if html == "" {
@@ -66,6 +88,9 @@ func ClearText(html string) string {
6688

6789
func UrlTitle(title string) (text string) {
6890
title = convertChinese(title)
91+
if transliterateNonLatin.Load() {
92+
title = convertNonLatin(title)
93+
}
6994
title = clearEmoji(title)
7095
title = slugify.Slugify(title)
7196
title = url.QueryEscape(title)
@@ -95,6 +120,30 @@ func convertChinese(content string) string {
95120
return strings.Join(pinyin.LazyConvert(content, nil), "-")
96121
}
97122

123+
// Short-circuits on Latin-only / Chinese-only input so existing slugs stay byte-identical.
124+
func convertNonLatin(content string) string {
125+
if !containsNonLatin(content) {
126+
return content
127+
}
128+
return unidecode.Unidecode(content)
129+
}
130+
131+
func containsNonLatin(content string) bool {
132+
for _, r := range content {
133+
switch {
134+
case r < 0x0080: // ASCII
135+
continue
136+
case r >= 0x0080 && r <= 0x024F: // Latin-1 Supplement, Latin Extended-A/B
137+
continue
138+
case unicode.Is(unicode.Han, r): // handled by convertChinese
139+
continue
140+
case unicode.IsLetter(r):
141+
return true
142+
}
143+
}
144+
return false
145+
}
146+
98147
func cutLongTitle(title string) string {
99148
maxBytes := 150
100149
if len(title) <= maxBytes {

pkg/htmltext/htmltext_test.go

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,111 @@ func TestUrlTitle(t *testing.T) {
8787
}
8888
}
8989

90+
func TestUrlTitleTable(t *testing.T) {
91+
// Long pure-Arabic title: 50 copies of the same Arabic word, joined by spaces.
92+
// Unidecode of "كيف" is "kyf", so the slug becomes "kyf-" repeated and
93+
// exceeds cutLongTitle's 150-byte cap.
94+
longArabic := strings.Repeat("كيف ", 50)
95+
wantLongArabic := strings.Repeat("kyf-", 37) + "ky" // 37*4 + 2 = 150 bytes
96+
97+
cases := []struct {
98+
name string
99+
title string
100+
want string
101+
}{
102+
{
103+
name: "empty",
104+
title: "",
105+
want: "topic",
106+
},
107+
{
108+
name: "pure latin unchanged",
109+
title: "hello world",
110+
want: "hello-world",
111+
},
112+
{
113+
// Pinyin conversion drops Latin runes by design — matches pre-fix behavior.
114+
name: "pure chinese unchanged",
115+
title: "这是一个,标题,title",
116+
want: "zhe-shi-yi-ge-biao-ti",
117+
},
118+
{
119+
// The fix: previously collapsed to "topic" for all of these scripts.
120+
// Outputs are an ASCII approximation, not linguistically correct
121+
// romanization — see PR description.
122+
name: "arabic transliterated",
123+
title: "كيف حالك",
124+
want: "kyf-hlk",
125+
},
126+
{
127+
name: "mixed latin and arabic",
128+
title: "مرحبا hello",
129+
want: "mrhb-hello",
130+
},
131+
{
132+
name: "thai transliterated",
133+
title: "ไทย ไทย",
134+
want: "aithy-aithy",
135+
},
136+
{
137+
name: "japanese hiragana transliterated",
138+
title: "こんにちは",
139+
want: "konnichiha",
140+
},
141+
{
142+
// Japanese with Han-block kanji is caught by the pre-existing pinyin
143+
// pre-step (Chinese reading, not Japanese), so this path is unchanged
144+
// by this PR. Pinning to document the existing behavior.
145+
name: "japanese kanji goes through pinyin path unchanged",
146+
title: "日本",
147+
want: "ri-ben",
148+
},
149+
{
150+
name: "korean transliterated",
151+
title: "안녕하세요",
152+
want: "annyeonghaseyo",
153+
},
154+
{
155+
name: "hebrew transliterated",
156+
title: "שלום עולם",
157+
want: "shlvm-vlm",
158+
},
159+
{
160+
name: "cyrillic transliterated",
161+
title: "Привет мир",
162+
want: "privet-mir",
163+
},
164+
{
165+
name: "emoji only falls back to topic",
166+
title: "😂😂😂",
167+
want: "topic",
168+
},
169+
{
170+
name: "long arabic truncates at cutLongTitle boundary",
171+
title: longArabic,
172+
want: wantLongArabic,
173+
},
174+
}
175+
for _, tc := range cases {
176+
t.Run(tc.name, func(t *testing.T) {
177+
got := UrlTitle(tc.title)
178+
assert.Equal(t, tc.want, got)
179+
})
180+
}
181+
}
182+
183+
func TestUrlTitleTransliterationToggle(t *testing.T) {
184+
defer SetTransliterateNonLatin(true)
185+
186+
SetTransliterateNonLatin(false)
187+
// With transliteration off, pure-Arabic titles collapse to the existing
188+
// "topic" fallback (the pre-fix behavior).
189+
assert.Equal(t, "topic", UrlTitle("كيف حالك"))
190+
191+
SetTransliterateNonLatin(true)
192+
assert.Equal(t, "kyf-hlk", UrlTitle("كيف حالك"))
193+
}
194+
90195
func TestFindFirstMatchedWord(t *testing.T) {
91196
var (
92197
expectedWord,

0 commit comments

Comments
 (0)