fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration (#1526)

ahmedqasid · claude · LinkinStars · web-flow · commit e884bb61cb1b · 2026-06-03T22:06:26.000+08:00
# fix: avoid `topic` fallback for non-Latin titles via pragmatic ASCII transliteration > **Scope update (in response to review):** this PR is intentionally broader than its original "Arabic-only" framing. The implementation changes URL slug generation for **every non-Latin, non-CJK script** that `slugify` previously stripped — see *Scope* below for the explicit list. The goal is *not* linguistically correct romanization; it is "avoid collapsing to `/topic` by producing a usable ASCII slug." ## What this PR is (and isn't) **Goal:** when a question title contains characters outside Basic Latin / Latin Extended / CJK Han, generate a URL slug that is a deterministic ASCII approximation instead of letting `slugify` strip everything and falling back to the literal `"topic"`. **Non-goal:** this is *not* a linguistically correct multi-language romanizer. The output is a machine-acceptable ASCII slug, not what a native speaker would choose. For example, `こんにちは` → `konnichiha` (not the more natural `kon'nichiwa`), `ไทย` → `aithy` (not `thai`). Treat the slug as an opaque, stable, indexable identifier — the path-after-`/questions/<id>/` is for SEO and shareability, the canonical reference is always the ID. ## The bug Pure non-Latin titles previously got stripped by `slugify.Slugify`, hit the empty-result fallback in `htmltext.UrlTitle`, and collapsed to the literal slug `"topic"`. On a live multilingual site, every Arabic / Thai / Japanese-hiragana / Korean / Hebrew / Cyrillic question ended up at `/questions/<id>/topic`. ## The fix `UrlTitle()` gets a `convertNonLatin` pre-step that mirrors the existing `convertChinese` pre-step pattern, using `github.com/mozillazg/go-unidecode` (same author as `go-pinyin` already in the repo, to minimise new-dep friction). ``` UrlTitle(title) → convertChinese(title) // pre-existing: Han-block → pinyin → convertNonLatin(title) // NEW: detect non-Latin letters → unidecode to ASCII → clearEmoji / slugify / url.QueryEscape / cutLongTitle (unchanged) ``` The non-Latin detector skips ASCII, Latin-1 Supplement, Latin Extended-A/B, and CJK Han. Inputs that hit none of those non-Latin letter categories short-circuit and return unchanged, so Latin-only and Chinese-only inputs remain byte-identical (pinned by tests). ## Scope — what scripts are affected This PR changes behavior for **any** title containing letters in scripts that `slugify` doesn't handle. Confirmed by tests in `pkg/htmltext/htmltext_test.go`: | Script | Example title | Before | After | | --- | --- | --- | --- | | Arabic | `كيف حالك` | `topic` | `kyf-hlk` | | Mixed Latin + Arabic | `مرحبا hello` | `hello` | `mrhb-hello` | | Thai | `ไทย ไทย` | `topic` | `aithy-aithy` | | Japanese hiragana | `こんにちは` | `topic` | `konnichiha` | | Korean | `안녕하세요` | `topic` | `annyeonghaseyo` | | Hebrew | `שלום עולם` | `topic` | `shlvm-vlm` | | Cyrillic | `Привет мир` | `topic` | `privet-mir` | **Unchanged:** | Case | Behavior | | --- | --- | | Pure Latin (`hello world`) | unchanged → `hello-world` | | Pure Chinese (`这是一个，标题，title`) | unchanged → `zhe-shi-yi-ge-biao-ti` (pinyin path) | | Japanese with Han-block kanji (`日本`) | unchanged → `ri-ben` (caught by pre-existing pinyin path; treated as Chinese reading, not Japanese — a pre-existing limitation, **not** introduced by this PR) | | Emoji only (`😂😂😂`) | unchanged → `topic` | | Empty / whitespace | unchanged → `topic` | ## Transliteration quality — explicit acknowledgement `go-unidecode` is a generic Unicode → ASCII approximation. It is **not** a per-language romanization library. Specifically: - It will pick *one* approximation per codepoint regardless of language context. `ใ` → `ai` (Thai romanization is `i` or `ai` depending on standard), `한` → `han`, `語` → `Yu` (Chinese pinyin reading even when used in Japanese), etc. - The result is *good enough* to be a stable, URL-safe, human-recognizable handle, but speakers of the source language will not consider it "correct." - It is deterministic, so the same title always produces the same slug — important since `url_title` is recomputed on every request. If maintainers prefer to scope this PR more narrowly (e.g. Arabic only, and reject Thai/Hebrew/Cyrillic/etc.), the detector in `containsNonLatin` can be tightened to specific Unicode blocks — but that means the other scripts continue to collapse to `topic`, which is the bug we're trying to fix. I'd argue the broader fix is preferable to a piecemeal one, but happy to narrow if you want. ## Live deployment / real-world verification This patch has been running in production on **[ask.namasoft.com](https://ask.namasoft.com)** (an Apache Answer instance we operate) since deployment, built directly from this branch via `docker compose build`. The site hosts Arabic-language questions, so the fix exercises the affected code path on every page load. Sample question URL on the deployed instance: > `https://ask.namasoft.com/questions/10010000000000115` The slug in the URL is the transliterated Arabic title rather than `topic`. No data migration was needed since `url_title` is computed on every request from `Title` and never persisted (see *Why this is safe to ship* below). ## Admin-configurable The transliteration is gated by a package-level `atomic.Bool` (default **on**, since the current behavior is objectively broken for affected users): - `htmltext.SetTransliterateNonLatin(enabled bool)` - `htmltext.IsTransliterateNonLatinEnabled() bool` This is deliberately the minimum surface needed to satisfy "the setting must be readable from `UrlTitle()`". A follow-up PR can add an admin UI section that calls `SetTransliterateNonLatin` on save and on startup, without having to re-plumb every `htmltext.UrlTitle` call site through `context.Context`. **Default choice — please confirm:** I picked **default-on** because the existing `topic` behavior is a bug for affected users. If you'd prefer default-off for strict backward compat on existing installs, flip the `init()` in `pkg/htmltext/htmltext.go` to `Store(false)` and surface the toggle as opt-in. ## Why this is safe to ship - `url_title` is **not** a persisted column. It's not on the `Question` entity in `internal/entity/question_entity.go`, no migration has ever added/dropped it, and every call site (`question_service.go`, `revision_service.go`, `vote_service.go`, search/report/review/rank/comment services, controllers, repos) recomputes it from `Title` at response-build time via `htmltext.UrlTitle(...)`. - That means the fix is read-only: existing rows light up with correct slugs on the next request, with no migration and no data rewrite. - Rollback is just redeploying the prior image; nothing on disk changes. ## Test coverage `pkg/htmltext/htmltext_test.go`: - **`TestUrlTitleTable`** — table-driven, one case per affected script (the full matrix above), plus: - `empty` → `topic` - `pure latin unchanged` → byte-identical to pre-fix - `pure chinese unchanged` → byte-identical to pre-fix (pins existing pinyin behavior) - `japanese kanji goes through pinyin path unchanged` → documents the pre-existing Han-block limitation - `emoji only falls back to topic` → unchanged - `long arabic truncates at cutLongTitle boundary` → exercises the 150-byte cap and UTF-8 boundary safety - **`TestUrlTitleTransliterationToggle`** — with the toggle off, non-Latin titles collapse to `topic` (pre-fix behavior); with it on, they transliterate. - Existing `TestUrlTitle` left untouched. Test plan for reviewers: - [ ] `go test ./pkg/htmltext/...` — all pass - [ ] Visit the live sample URL above and confirm slug is transliterated, not `topic` - [ ] Verify Chinese / Latin / emoji-only / empty behavior is byte-identical to `main` (covered by table tests) ## Out of scope (intentionally) - No admin UI / site setting plumbing in this PR — see *Admin-configurable* above. Happy to do the React `Non-Latin Languages Handling` admin page + `SiteType` + service / controller / migration in a follow-up if maintainers want it. - No change to the `"topic"` empty-result fallback. - No plugin interface for slug generation — mirrored the existing `convertChinese` pre-step pattern instead. - No per-language romanization library — this is an explicit non-goal; see *Transliteration quality* above. ## Issues / discussion I didn't find an existing upstream issue covering this — happy to be pointed at one if there is. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: LinkinStars <linkinstar@foxmail.com>
diff --git a/docs/release/licenses/LICENSE-mozillazg-go-unidecode.txt b/docs/release/licenses/LICENSE-mozillazg-go-unidecode.txt
@@ -0,0 +1,21 @@
+The MIT License (MIT)
+
+Copyright (c) 2016 mozillazg
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/go.mod b/go.mod
@@ -43,6 +43,7 @@ require (
 	github.com/mark3labs/mcp-go v0.43.2
 	github.com/microcosm-cc/bluemonday v1.0.27
 	github.com/mozillazg/go-pinyin v0.20.0
+	github.com/mozillazg/go-unidecode v0.2.0
 	github.com/ory/dockertest/v3 v3.11.0
 	github.com/robfig/cron/v3 v3.0.1
 	github.com/sashabaranov/go-openai v1.41.2
diff --git a/go.sum b/go.sum
@@ -462,6 +462,8 @@ github.com/modern-go/reflect2 v1.0.2 h1:xBagoLtFs94CBntxluKeaWgTMpvLxC4ur3nMaC9G
 github.com/modern-go/reflect2 v1.0.2/go.mod h1:yWuevngMOJpCy52FWWMvUC8ws7m/LJsjYzDa0/r8luk=
 github.com/mozillazg/go-pinyin v0.20.0 h1:BtR3DsxpApHfKReaPO1fCqF4pThRwH9uwvXzm+GnMFQ=
 github.com/mozillazg/go-pinyin v0.20.0/go.mod h1:iR4EnMMRXkfpFVV5FMi4FNB6wGq9NV6uDWbUuPhP4Yc=
+github.com/mozillazg/go-unidecode v0.2.0 h1:vFGEzAH9KSwyWmXCOblazEWDh7fOkpmy/Z4ArmamSUc=
+github.com/mozillazg/go-unidecode v0.2.0/go.mod h1:zB48+/Z5toiRolOZy9ksLryJ976VIwmDmpQ2quyt1aA=
 github.com/mwitkow/go-conntrack v0.0.0-20161129095857-cc309e4a2223/go.mod h1:qRWi+5nqEBWmkhHvq77mSJWrCKwh8bxhgT7d/eI7P4U=
 github.com/nats-io/jwt v0.3.0/go.mod h1:fRYCDE99xlTsqUzISS1Bi75UBJ6ljOJQOAAu5VglpSg=
 github.com/nats-io/jwt v0.3.2/go.mod h1:/euKqTS1ZD+zzjYrY7pseZrTtWQSjujC7xjPc8wL6eU=
diff --git a/pkg/htmltext/htmltext.go b/pkg/htmltext/htmltext.go
@@ -25,13 +25,16 @@ import (
 	"net/url"
 	"regexp"
 	"strings"
+	"sync/atomic"
+	"unicode"
 	"unicode/utf8"
 
 	"github.com/Machiel/slugify"
 	"github.com/apache/answer/pkg/checker"
 	"github.com/apache/answer/pkg/converter"
 	strip "github.com/grokify/html-strip-tags-go"
 	"github.com/mozillazg/go-pinyin"
+	"github.com/mozillazg/go-unidecode"
 )
 
 var (
@@ -47,8 +50,27 @@ var (
 		"\r", " ",
 		"\t", " ",
 	)
+
+	// Without this, pure non-Latin titles (Arabic, Cyrillic, Hebrew, ...) get
+	// stripped by slugify and collapse to the "topic" fallback. Chinese is
+	// handled separately by convertChinese.
+	transliterateNonLatin atomic.Bool
 )
 
+func init() {
+	transliterateNonLatin.Store(true)
+}
+
+// SetTransliterateNonLatin toggles non-Latin script transliteration for URL slugs.
+func SetTransliterateNonLatin(enabled bool) {
+	transliterateNonLatin.Store(enabled)
+}
+
+// IsTransliterateNonLatinEnabled reports whether non-Latin transliteration is on.
+func IsTransliterateNonLatinEnabled() bool {
+	return transliterateNonLatin.Load()
+}
+
 // ClearText clear HTML, get the clear text
 func ClearText(html string) string {
 	if html == "" {
@@ -66,6 +88,9 @@ func ClearText(html string) string {
 
 func UrlTitle(title string) (text string) {
 	title = convertChinese(title)
+	if transliterateNonLatin.Load() {
+		title = convertNonLatin(title)
+	}
 	title = clearEmoji(title)
 	title = slugify.Slugify(title)
 	title = url.QueryEscape(title)
@@ -95,6 +120,30 @@ func convertChinese(content string) string {
 	return strings.Join(pinyin.LazyConvert(content, nil), "-")
 }
 
+// Short-circuits on Latin-only / Chinese-only input so existing slugs stay byte-identical.
+func convertNonLatin(content string) string {
+	if !containsNonLatin(content) {
+		return content
+	}
+	return unidecode.Unidecode(content)
+}
+
+func containsNonLatin(content string) bool {
+	for _, r := range content {
+		switch {
+		case r < 0x0080: // ASCII
+			continue
+		case r >= 0x0080 && r <= 0x024F: // Latin-1 Supplement, Latin Extended-A/B
+			continue
+		case unicode.Is(unicode.Han, r): // handled by convertChinese
+			continue
+		case unicode.IsLetter(r):
+			return true
+		}
+	}
+	return false
+}
+
 func cutLongTitle(title string) string {
 	maxBytes := 150
 	if len(title) <= maxBytes {
diff --git a/pkg/htmltext/htmltext_test.go b/pkg/htmltext/htmltext_test.go
@@ -87,6 +87,111 @@ func TestUrlTitle(t *testing.T) {
 	}
 }
 
+func TestUrlTitleTable(t *testing.T) {
+	// Long pure-Arabic title: 50 copies of the same Arabic word, joined by spaces.
+	// Unidecode of "كيف" is "kyf", so the slug becomes "kyf-" repeated and
+	// exceeds cutLongTitle's 150-byte cap.
+	longArabic := strings.Repeat("كيف ", 50)
+	wantLongArabic := strings.Repeat("kyf-", 37) + "ky" // 37*4 + 2 = 150 bytes
+
+	cases := []struct {
+		name  string
+		title string
+		want  string
+	}{
+		{
+			name:  "empty",
+			title: "",
+			want:  "topic",
+		},
+		{
+			name:  "pure latin unchanged",
+			title: "hello world",
+			want:  "hello-world",
+		},
+		{
+			// Pinyin conversion drops Latin runes by design — matches pre-fix behavior.
+			name:  "pure chinese unchanged",
+			title: "这是一个，标题，title",
+			want:  "zhe-shi-yi-ge-biao-ti",
+		},
+		{
+			// The fix: previously collapsed to "topic" for all of these scripts.
+			// Outputs are an ASCII approximation, not linguistically correct
+			// romanization — see PR description.
+			name:  "arabic transliterated",
+			title: "كيف حالك",
+			want:  "kyf-hlk",
+		},
+		{
+			name:  "mixed latin and arabic",
+			title: "مرحبا hello",
+			want:  "mrhb-hello",
+		},
+		{
+			name:  "thai transliterated",
+			title: "ไทย ไทย",
+			want:  "aithy-aithy",
+		},
+		{
+			name:  "japanese hiragana transliterated",
+			title: "こんにちは",
+			want:  "konnichiha",
+		},
+		{
+			// Japanese with Han-block kanji is caught by the pre-existing pinyin
+			// pre-step (Chinese reading, not Japanese), so this path is unchanged
+			// by this PR. Pinning to document the existing behavior.
+			name:  "japanese kanji goes through pinyin path unchanged",
+			title: "日本",
+			want:  "ri-ben",
+		},
+		{
+			name:  "korean transliterated",
+			title: "안녕하세요",
+			want:  "annyeonghaseyo",
+		},
+		{
+			name:  "hebrew transliterated",
+			title: "שלום עולם",
+			want:  "shlvm-vlm",
+		},
+		{
+			name:  "cyrillic transliterated",
+			title: "Привет мир",
+			want:  "privet-mir",
+		},
+		{
+			name:  "emoji only falls back to topic",
+			title: "😂😂😂",
+			want:  "topic",
+		},
+		{
+			name:  "long arabic truncates at cutLongTitle boundary",
+			title: longArabic,
+			want:  wantLongArabic,
+		},
+	}
+	for _, tc := range cases {
+		t.Run(tc.name, func(t *testing.T) {
+			got := UrlTitle(tc.title)
+			assert.Equal(t, tc.want, got)
+		})
+	}
+}
+
+func TestUrlTitleTransliterationToggle(t *testing.T) {
+	defer SetTransliterateNonLatin(true)
+
+	SetTransliterateNonLatin(false)
+	// With transliteration off, pure-Arabic titles collapse to the existing
+	// "topic" fallback (the pre-fix behavior).
+	assert.Equal(t, "topic", UrlTitle("كيف حالك"))
+
+	SetTransliterateNonLatin(true)
+	assert.Equal(t, "kyf-hlk", UrlTitle("كيف حالك"))
+}
+
 func TestFindFirstMatchedWord(t *testing.T) {
 	var (
 		expectedWord,