Skip to content

Draft: Add practical UTF-8 support across terminal input/output paths#2416

Closed
Kondrashka177 wants to merge 5 commits intocc-tweaked:mc-1.20.xfrom
Kondrashka177:wip/unicode-support
Closed

Draft: Add practical UTF-8 support across terminal input/output paths#2416
Kondrashka177 wants to merge 5 commits intocc-tweaked:mc-1.20.xfrom
Kondrashka177:wip/unicode-support

Conversation

@Kondrashka177
Copy link
Copy Markdown

Summary

This PR is an attempt to improve practical UTF-8 support in CC: Tweaked's terminal-related paths.

The goal here is not to claim that full Unicode support is solved, but to provide a working implementation for the most common user-facing cases inside CC itself: terminal output, terminal input, text editing, monitor rendering, and related ROM paths.

I am opening this as a draft because I do not consider the design questions around backwards compatibility fully resolved, and I want to present the implementation and tested behaviour clearly rather than oversell it as a finished universal solution.

What this changes

This branch updates multiple terminal and ROM text paths so that UTF-8 text can be handled correctly in common scenarios.

In practice, this includes work around:

  • terminal text input/output
  • term.write
  • term.blit
  • terminal/window text storage and rendering
  • read()
  • edit.lua
  • monitor rendering
  • Lua REPL output/error handling paths
  • cc.pretty
  • cc.strings
  • pastebin put/get

Tested behaviour

The following scenarios have been tested and are working in this branch:

  • UTF-8 output in the normal terminal
  • UTF-8 input through read()
  • term.write
  • term.blit
  • window.write
  • window.blit
  • editing UTF-8 text in edit
  • rendering UTF-8 text on monitors
  • Lua REPL no longer failing with Invalid UTF-8 text in the tested cases
  • pastebin put/get with UTF-8 content

Important limitations

This branch does not claim to solve all Unicode issues.

Known limitations include:

  • complex emoji sequences
  • flags
  • ZWJ-based grapheme clusters
  • the broader legacy charset/backwards-compatibility problem
  • cases where existing byte-oriented behaviour may be relied upon by older programs

In other words, this branch is best described as a practical UTF-8 implementation for common CC text workflows, not a complete and compatibility-perfect Unicode redesign.

Backwards compatibility

I understand the main concern here is not just "can UTF-8 be made to work", but whether it can be introduced without breaking long-standing byte-oriented assumptions in CC programs and internal behaviour.

I do not want to pretend this draft fully solves that design problem.

This PR is therefore intended as:

  • a working implementation of common user-facing cases
  • a concrete basis for discussion
  • a demonstration of what currently works well in practice

If this direction is considered fundamentally incompatible with the project's compatibility goals, that is understandable. In that case, this branch may still be useful as a reference implementation or experiment.

Why submit this anyway

Even with the compatibility concerns, I think there is still value in showing the implementation and the practical results.

The issue affects real users in day-to-day use, especially in non-English environments, and this branch demonstrates that a substantial part of the user-facing experience can be improved inside CC itself.

Notes

I am very open to feedback on scope, structure, or whether this is better treated as an experiment/prototype rather than a mergeable change in its current form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant