fix: preserve angle brackets inside fenced code blocks during sanitization#2308
fix: preserve angle brackets inside fenced code blocks during sanitization#2308lawrence3699 wants to merge 1 commit intogithub:mainfrom
Conversation
…ation bluemonday.StrictPolicy() treats angle-bracket sequences like <int>, <T>, <string> as HTML tags and strips them, even when they appear inside fenced code blocks. This causes code samples containing generics, templates, or any angle-bracket syntax to lose content when read through the MCP tools. The fix adds a protect/restore step around FilterHTMLTags: before bluemonday runs, < and > inside fenced code blocks are replaced with NUL-delimited placeholders that bluemonday passes through unchanged; after bluemonday, the placeholders are restored. The fence detection follows the same pattern used by FilterCodeFenceMetadata, with one improvement: closing fences are accepted when they are at least as long as the opening fence (per CommonMark spec), preventing a longer closing fence from leaking placeholder-protection into subsequent non-code content. Fixes github#2202
There was a problem hiding this comment.
Pull request overview
This PR fixes an issue where Sanitize() would silently drop angle-bracketed tokens like <int> inside fenced code blocks because the HTML sanitizer (bluemonday.StrictPolicy) treats them as HTML tags. It adds a protection/restore step so angle brackets inside fenced code are preserved while HTML outside code blocks is still sanitized.
Changes:
- Refactors
Sanitize()into a stepwise pipeline and addsprotectCodeAngles/restoreCodeAnglesaroundFilterHTMLTags. - Implements placeholder-based shielding of
</>within fenced code blocks. - Adds unit tests for the new protection/restore helpers and end-to-end
Sanitizeregression/security tests.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/sanitize/sanitize.go | Adds code-fence-aware placeholder protection for angle brackets during HTML sanitization. |
| pkg/sanitize/sanitize_test.go | Adds unit and integration tests ensuring code-block angle brackets are preserved and script tags remain stripped. |
| // restoreCodeAngles reverses the placeholder substitution performed by | ||
| // protectCodeAngles. | ||
| func restoreCodeAngles(input string) string { | ||
| s := strings.ReplaceAll(input, codeLtPlaceholder, "<") | ||
| s = strings.ReplaceAll(s, codeGtPlaceholder, ">") | ||
| return s |
There was a problem hiding this comment.
restoreCodeAngles blindly replaces the placeholders everywhere in the sanitized output. Since FilterInvisibleCharacters does not remove NUL ("\x00"), a crafted input could include the placeholder sequences (e.g., "\x00CODELT\x00") outside a fenced block and have them converted into literal </> after FilterHTMLTags, potentially reintroducing HTML tags that the sanitizer would otherwise remove. Consider either (a) stripping \x00 in FilterInvisibleCharacters, and/or (b) making restoreCodeAngles only restore placeholders within fenced code blocks (mirroring protectCodeAngles) so user-supplied placeholder text cannot bypass HTML sanitization.
| for i, line := range lines { | ||
| fenceIdx := strings.Index(line, "```") | ||
|
|
||
| if fenceIdx != -1 && !hasNonWhitespace(line[:fenceIdx]) { | ||
| fenceEnd := fenceIdx | ||
| for fenceEnd < len(line) && line[fenceEnd] == '`' { |
There was a problem hiding this comment.
protectCodeAngles re-implements fenced-code tracking instead of reusing sanitizeCodeFenceLine / FilterCodeFenceMetadata logic, and it uses different closing-fence semantics (>= currentFenceLen here vs exact-length matching in sanitizeCodeFenceLine). This divergence can cause the two passes to disagree about whether subsequent lines are inside a fence (e.g., with a longer closing fence), making future behavior brittle. Consider extracting a shared fence parser/toggler so both steps stay consistent (and updating one place if the fence rules change).
Summary
FilterHTMLTagsusesbluemonday.StrictPolicy()which strips anything that looks like an HTML tag — including<int>,<T>,<string>, etc. inside fenced code blocks. When a user files an issue containing C++/Rust/Java generics in a code block, the MCP tool returns the code with those tokens silently removed.Root cause
Sanitize()feeds the entire text (including code block content) through bluemonday, which has no concept of markdown structure. A line likevector<int> v;inside a fenced block gets<int>stripped because bluemonday interprets it as an unknown HTML tag.Before:
After this fix:
Approach
Added a protect/restore step around
FilterHTMLTagsin theSanitizepipeline:protectCodeAngles— walks the text line-by-line tracking fenced code block state (same fence detection pattern used byFilterCodeFenceMetadata), and replaces</>inside code blocks with NUL-delimited placeholders that bluemonday passes throughrestoreCodeAngles— reverses the placeholders after sanitizationThe closing fence check uses
>=(per CommonMark spec) rather than exact-match, so a longer closing fence correctly ends the code block and does not leak placeholder-protection into subsequent content.Angle brackets outside code blocks continue to be sanitized normally —
<script>alert(1)</script>after a code block is still stripped.Tests
protectCodeAnglescovering: empty input, no code blocks, language-tagged fences, multiple blocks, 4-backtick fences, shorter/longer closing fences, unclosed fencesrestoreCodeAnglesSanitizetests including the exact scenario from GitHub MCP issue read appears to drop code block text in angle brackets #2202, nested generics (Map<String, List<Integer>>), and security regression tests confirming<script>tags after code blocks are still strippedAll existing tests continue to pass.
go vetandgolangci-lintclean.Known limitation
Tilde-fenced blocks (
~~~) are not handled, consistent with the existingFilterCodeFenceMetadatawhich also only recognizes backtick fences.Fixes #2202