wiki:UTF8Notes

Version 3 (modified by noz, 8 years ago) (diff)

Added detail on internals and SDL port

UTF-8 and Wide-character handling in Angband

Background

As of the sequence of commits from 589d1d3 to c91ae22, (plus a few bug-fixes and cleanups since) the Angband source is able to handle UTF-8 characters in its edit files, and has dropped the previous hacky mechanism of generating accented characters (with sequences like ["e] for ë).

There are a number of changes that had to happen for this to work, and this page aims to document them.

Locale

Angband now needs to be run within a UTF-8 capable locale, and this is checked in main.c:main(), as:

if (setlocale(LC_CTYPE, "")) {
	/* Require UTF-8 */
	if (strcmp(nl_langinfo(CODESET), "UTF-8") != 0)
		quit("Angband requires UTF-8 support");
}

Files

All the edit files are now expected to be in the UTF-8 encoding, and can have accented characters directly inserted in them. Output files such as spoilers, character dumps and other text output is now in UTF-8.

(What about screen dumps?)

Internals

"Canvas"

The internal representation of the main (and other terminal) screen(s) is as two arrays, one of "attributes" (byte attr) and one of "characters" (wchar_t char). The characters to be displayed are stored as unicode, in the native wchar_t representation of a unicode character on the platform, whatever that is. When strings are printed to the screen, they are converted from UTF-8 (as char *) to wide characters (wchar_t *) using z-term.c:Term_mbstowcs(). This allows the conversion function to be overloaded if a particular platform needs it.

When they are displayed, the wide chars are put on the screen in different ways, depending on the port (see below). In the case of graphics tiles, things are slightly different. The "character" is still stored as a wchar_t, but only the bottom 7 bits are used, as an index into a large 2-D bitmap, containing the tiles along the x axis, and the attributes ("colour") on the y axis. (Check this bit) In the original design, I had hoped to treat tiles as a special case of a font, and allow all unicode character support, but the tiles are multi-coloured, so this cannot work.

Textblock

Textblocks (in z-textblock.c) also have wchar_t as the internal representation of the displayed characters. These are then copied directly onto the canvas when the textblock is displayed.

Parsers

In reading the edit files, all strings are maintained in UTF-8 until needed. Glyphs are read in directly to a wchar_t type.

Ports

This section lists port-specific changes and what the individual ports do with the wide-char representation of the display characters to get them onto the display.

SDL

Wide chars on the canvas are converted to a UTF-8 string using wcstombs() and then rendered to the screen using a pre-computed TTF_Font using TTF_RenderUTF8_Solid() in the function sdl_FontDraw().

X11

GCU

Windows

OSX

GTK

Android