wiki:UTF8Notes

Version 4 (modified by noz, 8 years ago) (diff)

Updated remaining ports

UTF-8 and Wide-character handling in Angband

Background

As of the sequence of commits from 589d1d3 to c91ae22, (plus a few bug-fixes and cleanups since) the Angband source is able to handle UTF-8 characters in its edit files, and has dropped the previous hacky mechanism of generating accented characters (with sequences like ["e] for ë).

There are a number of changes that had to happen for this to work, and this page aims to document them.

Locale

Angband now needs to be run within a UTF-8 capable locale, and this is checked in main.c:main(), as:

if (setlocale(LC_CTYPE, "")) {
	/* Require UTF-8 */
	if (strcmp(nl_langinfo(CODESET), "UTF-8") != 0)
		quit("Angband requires UTF-8 support");
}

Files

All the edit files are now expected to be in the UTF-8 encoding, and can have accented characters directly inserted in them. Output files such as spoilers, character dumps and other text output is now in UTF-8.

(What about screen dumps?)


Internals

"Canvas"

The internal representation of the main (and other terminal) screen(s) is as two arrays, one of "attributes" (byte attr) and one of "characters" (wchar_t char). The characters to be displayed are stored as unicode, in the native wchar_t representation of a unicode character on the platform, whatever that is. When strings are printed to the screen, they are converted from UTF-8 (as char *) to wide characters (wchar_t *) using z-term.c:Term_mbstowcs(). This allows the conversion function to be overloaded if a particular platform needs it.

When they are displayed, the wide chars are put on the screen in different ways, depending on the port (see below). In the case of graphics tiles, things are slightly different. The "character" is still stored as a wchar_t, but only the bottom 7 bits are used, as an index into a large 2-D bitmap, containing the tiles along the x axis, and the attributes ("colour") on the y axis. (Check this bit). A tile used to be indicated by the top bit set in both the attribute and the character, but it is now indicated only by the top bit of the attribute. In the original design, I had hoped to treat tiles as a special case of a font, and allow all unicode character support, but the tiles are multi-coloured, so this cannot work.

Textblock

Textblocks (in z-textblock.c) also have wchar_t as the internal representation of the displayed characters. These are then copied directly onto the canvas when the textblock is displayed.

Parsers

In reading the edit files, all strings are maintained in UTF-8 until needed. Glyphs are read in directly to a wchar_t type.


Ports

This section lists port-specific changes and what the individual ports do with the wide-char representation of the display characters to get them onto the display.

SDL

Wide chars on the canvas are converted to a UTF-8 string using wcstombs() and then rendered to the screen using a pre-computed TTF_Font using TTF_RenderUTF8_Solid() in the function sdl_FontDraw().

X11

Wide chars on the canvas are drawn directly to the window using XwcDrawImageString() in Infofnt_test_std(). Fonts are now rendered using XFontSets, rather than the previous XFontStruct.

GCU

Now requires the "wide" version of ncurses (i.e. ncursesw), and will fail to build if this is not present.

Wide characters from the canvas are written directly to the screen using mvwaddnwstr() in Term_text_gcu().

Some of the default symbols have changed as follows:

Feature From To
FloorPeriod '.' (U+002E)MIDDLE DOT '·' (U+00B7)
Magma?? (0x03)MEDIUM SHADE '▒' (U+2592)
Quartz Vein?? (0x03)LIGHT SHADE '░' (U+2591)
Granite Wall?? (0x02)DARK SHADE '▓' (U+2593)

This has the added advantage that standard fonts can be used, and it is not necessary to resort to hacking fonts to get "solid walls".

Windows

Windows does not properly support UTF-8 using the standard C library routines for locale, so the term->mbcs_hook function is defined to use the Windows-native MultiByteToWideChar?() function, and the external files are assumed to be in UTF-8. Wide chars from the canvas are written directly to the screen using ExtTextOutW() in Term_text_win().

OSX

Work in progress (I believe).

GTK

No work has been done to change the GTK port to support UTF-8, so it will probably not even compile.

Android

There are significant problems in adapting an Android port to this change, as the support for wide chars in older versions of Android is lacking. I understand that wchar_t is implemented as an 8-bit quantity, and some of the support functions such as mbstowcs() are missing, or broken. It may be possible to overload this using Term_mbstowcs(). Please update this if you make any significant progress