Commit e884bb6
fix: avoid topic fallback for non-Latin titles via pragmatic ASCII transliteration (#1526)
# fix: avoid `topic` fallback for non-Latin titles via pragmatic ASCII
transliteration
> **Scope update (in response to review):** this PR is intentionally
broader than its original "Arabic-only" framing. The implementation
changes URL slug generation for **every non-Latin, non-CJK script** that
`slugify` previously stripped — see *Scope* below for the explicit list.
The goal is *not* linguistically correct romanization; it is "avoid
collapsing to `/topic` by producing a usable ASCII slug."
## What this PR is (and isn't)
**Goal:** when a question title contains characters outside Basic Latin
/ Latin Extended / CJK Han, generate a URL slug that is a deterministic
ASCII approximation instead of letting `slugify` strip everything and
falling back to the literal `"topic"`.
**Non-goal:** this is *not* a linguistically correct multi-language
romanizer. The output is a machine-acceptable ASCII slug, not what a
native speaker would choose. For example, `こんにちは` → `konnichiha` (not
the more natural `kon'nichiwa`), `ไทย` → `aithy` (not `thai`). Treat the
slug as an opaque, stable, indexable identifier — the
path-after-`/questions/<id>/` is for SEO and shareability, the canonical
reference is always the ID.
## The bug
Pure non-Latin titles previously got stripped by `slugify.Slugify`, hit
the empty-result fallback in `htmltext.UrlTitle`, and collapsed to the
literal slug `"topic"`. On a live multilingual site, every Arabic / Thai
/ Japanese-hiragana / Korean / Hebrew / Cyrillic question ended up at
`/questions/<id>/topic`.
## The fix
`UrlTitle()` gets a `convertNonLatin` pre-step that mirrors the existing
`convertChinese` pre-step pattern, using
`github.com/mozillazg/go-unidecode` (same author as `go-pinyin` already
in the repo, to minimise new-dep friction).
```
UrlTitle(title)
→ convertChinese(title) // pre-existing: Han-block → pinyin
→ convertNonLatin(title) // NEW: detect non-Latin letters → unidecode to ASCII
→ clearEmoji / slugify / url.QueryEscape / cutLongTitle (unchanged)
```
The non-Latin detector skips ASCII, Latin-1 Supplement, Latin
Extended-A/B, and CJK Han. Inputs that hit none of those non-Latin
letter categories short-circuit and return unchanged, so Latin-only and
Chinese-only inputs remain byte-identical (pinned by tests).
## Scope — what scripts are affected
This PR changes behavior for **any** title containing letters in scripts
that `slugify` doesn't handle. Confirmed by tests in
`pkg/htmltext/htmltext_test.go`:
| Script | Example title | Before | After |
| --- | --- | --- | --- |
| Arabic | `كيف حالك` | `topic` | `kyf-hlk` |
| Mixed Latin + Arabic | `مرحبا hello` | `hello` | `mrhb-hello` |
| Thai | `ไทย ไทย` | `topic` | `aithy-aithy` |
| Japanese hiragana | `こんにちは` | `topic` | `konnichiha` |
| Korean | `안녕하세요` | `topic` | `annyeonghaseyo` |
| Hebrew | `שלום עולם` | `topic` | `shlvm-vlm` |
| Cyrillic | `Привет мир` | `topic` | `privet-mir` |
**Unchanged:**
| Case | Behavior |
| --- | --- |
| Pure Latin (`hello world`) | unchanged → `hello-world` |
| Pure Chinese (`这是一个,标题,title`) | unchanged → `zhe-shi-yi-ge-biao-ti`
(pinyin path) |
| Japanese with Han-block kanji (`日本`) | unchanged → `ri-ben` (caught by
pre-existing pinyin path; treated as Chinese reading, not Japanese — a
pre-existing limitation, **not** introduced by this PR) |
| Emoji only (`😂😂😂`) | unchanged → `topic` |
| Empty / whitespace | unchanged → `topic` |
## Transliteration quality — explicit acknowledgement
`go-unidecode` is a generic Unicode → ASCII approximation. It is **not**
a per-language romanization library. Specifically:
- It will pick *one* approximation per codepoint regardless of language
context. `ใ` → `ai` (Thai romanization is `i` or `ai` depending on
standard), `한` → `han`, `語` → `Yu` (Chinese pinyin reading even when
used in Japanese), etc.
- The result is *good enough* to be a stable, URL-safe,
human-recognizable handle, but speakers of the source language will not
consider it "correct."
- It is deterministic, so the same title always produces the same slug —
important since `url_title` is recomputed on every request.
If maintainers prefer to scope this PR more narrowly (e.g. Arabic only,
and reject Thai/Hebrew/Cyrillic/etc.), the detector in
`containsNonLatin` can be tightened to specific Unicode blocks — but
that means the other scripts continue to collapse to `topic`, which is
the bug we're trying to fix. I'd argue the broader fix is preferable to
a piecemeal one, but happy to narrow if you want.
## Live deployment / real-world verification
This patch has been running in production on
**[ask.namasoft.com](https://ask.namasoft.com)** (an Apache Answer
instance we operate) since deployment, built directly from this branch
via `docker compose build`. The site hosts Arabic-language questions, so
the fix exercises the affected code path on every page load.
Sample question URL on the deployed instance:
> `https://ask.namasoft.com/questions/10010000000000115`
The slug in the URL is the transliterated Arabic title rather than
`topic`. No data migration was needed since `url_title` is computed on
every request from `Title` and never persisted (see *Why this is safe to
ship* below).
## Admin-configurable
The transliteration is gated by a package-level `atomic.Bool` (default
**on**, since the current behavior is objectively broken for affected
users):
- `htmltext.SetTransliterateNonLatin(enabled bool)`
- `htmltext.IsTransliterateNonLatinEnabled() bool`
This is deliberately the minimum surface needed to satisfy "the setting
must be readable from `UrlTitle()`". A follow-up PR can add an admin UI
section that calls `SetTransliterateNonLatin` on save and on startup,
without having to re-plumb every `htmltext.UrlTitle` call site through
`context.Context`.
**Default choice — please confirm:** I picked **default-on** because the
existing `topic` behavior is a bug for affected users. If you'd prefer
default-off for strict backward compat on existing installs, flip the
`init()` in `pkg/htmltext/htmltext.go` to `Store(false)` and surface the
toggle as opt-in.
## Why this is safe to ship
- `url_title` is **not** a persisted column. It's not on the `Question`
entity in `internal/entity/question_entity.go`, no migration has ever
added/dropped it, and every call site (`question_service.go`,
`revision_service.go`, `vote_service.go`,
search/report/review/rank/comment services, controllers, repos)
recomputes it from `Title` at response-build time via
`htmltext.UrlTitle(...)`.
- That means the fix is read-only: existing rows light up with correct
slugs on the next request, with no migration and no data rewrite.
- Rollback is just redeploying the prior image; nothing on disk changes.
## Test coverage
`pkg/htmltext/htmltext_test.go`:
- **`TestUrlTitleTable`** — table-driven, one case per affected script
(the full matrix above), plus:
- `empty` → `topic`
- `pure latin unchanged` → byte-identical to pre-fix
- `pure chinese unchanged` → byte-identical to pre-fix (pins existing
pinyin behavior)
- `japanese kanji goes through pinyin path unchanged` → documents the
pre-existing Han-block limitation
- `emoji only falls back to topic` → unchanged
- `long arabic truncates at cutLongTitle boundary` → exercises the
150-byte cap and UTF-8 boundary safety
- **`TestUrlTitleTransliterationToggle`** — with the toggle off,
non-Latin titles collapse to `topic` (pre-fix behavior); with it on,
they transliterate.
- Existing `TestUrlTitle` left untouched.
Test plan for reviewers:
- [ ] `go test ./pkg/htmltext/...` — all pass
- [ ] Visit the live sample URL above and confirm slug is
transliterated, not `topic`
- [ ] Verify Chinese / Latin / emoji-only / empty behavior is
byte-identical to `main` (covered by table tests)
## Out of scope (intentionally)
- No admin UI / site setting plumbing in this PR — see
*Admin-configurable* above. Happy to do the React `Non-Latin Languages
Handling` admin page + `SiteType` + service / controller / migration in
a follow-up if maintainers want it.
- No change to the `"topic"` empty-result fallback.
- No plugin interface for slug generation — mirrored the existing
`convertChinese` pre-step pattern instead.
- No per-language romanization library — this is an explicit non-goal;
see *Transliteration quality* above.
## Issues / discussion
I didn't find an existing upstream issue covering this — happy to be
pointed at one if there is.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: LinkinStars <linkinstar@foxmail.com>1 parent 68085ab commit e884bb6
5 files changed
Lines changed: 178 additions & 0 deletions
File tree
- docs/release/licenses
- pkg/htmltext
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| 46 | + | |
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
462 | 462 | | |
463 | 463 | | |
464 | 464 | | |
| 465 | + | |
| 466 | + | |
465 | 467 | | |
466 | 468 | | |
467 | 469 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
| 29 | + | |
28 | 30 | | |
29 | 31 | | |
30 | 32 | | |
31 | 33 | | |
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
| 37 | + | |
35 | 38 | | |
36 | 39 | | |
37 | 40 | | |
| |||
47 | 50 | | |
48 | 51 | | |
49 | 52 | | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
50 | 58 | | |
51 | 59 | | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
52 | 74 | | |
53 | 75 | | |
54 | 76 | | |
| |||
66 | 88 | | |
67 | 89 | | |
68 | 90 | | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
69 | 94 | | |
70 | 95 | | |
71 | 96 | | |
| |||
95 | 120 | | |
96 | 121 | | |
97 | 122 | | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
98 | 147 | | |
99 | 148 | | |
100 | 149 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
90 | 195 | | |
91 | 196 | | |
92 | 197 | | |
| |||
0 commit comments