Transcript of talk: "Rendering and laying out Unicode text"

KDE Contributor and Developer Conference

Notes taken by: Jonathan Riddell (these are not official material)

More information on the talk can be found here

This is a topic he's been looking at for the last 4 years with Qt and KDE. KHTML, Kate, KOffice and 1 other widget in KDE require unicode text layout and rendering.

Unicode is a standard about 10 to 15 years old, 4.0.1 is the current version. 3.2 encoded about 95,000 characters almost 70,000 for CJK languages. Encodes characters not glyphs, some combinations of letters use different glyphs which creates more problems. Encodes around 50 different writing languages all in use as modern languages so they have to be implemented.

Character is a unicode item, the smallest unit you can use for text processing, can be up to 2 QChar's wide.

Glyph is a visual representation of a character or a group of characters or part of a character. Diacritic or combining mark is accents.

Western scripts include latin and cyrillic. Easiest to deal with. Mostly a one to one mapping between character and glyphs. No special treatment needed for rendering. Line breaks are at word boundaries. Some glyph complication with "fi" of "fl" which are rendered as one glyph. Accents require work to make it look exact. Kerning - playing with spacing is another issue. This is not working in Qt 3 but should do in Qt 4.

CJK Chinese, Japanese and Korean has a huge number of characters. Lots of characters are common and mean the same in the three languages. 1:1 character to glyph mappping. Line breaking rules different from latin, no spaces, line breaks usually allowed after every character, certain characters disallow line breaks before or after e.g. no line breaks after punctuation. Left to right has been adopted for most computer uses.

Semitic languages are written from right to left. Hebrew, Arabic, Urdu and others. Numbers are still left to right. In Arabic languages characters change shape depending on context (similar to capital letters but with different rules). Some combinations change the glyph completely.

Unicode bidirectional algorithm. Strings are stored in logical spoken order. Reorders from logical order to visual representation on the screen. So stored in memory as "hello WORLD" where WORLD is in a semitic language would be rendered as "hello DLROW" if the paragraph is left to right. If paragraph is right to left it is rendered as "DLROW hello". Ordering of a line depends on contents of the previous line.

Thai and Lao includes combining marks and stacked letters above or below other letters. Written without spaces but line breaks only at word boundaries so requires word dictionary to calculate line breaks.

Most complex is Indic languages. Basic building block is a syllable. Complex shaping rules within syllables. Cursor movement bound to syllable boundaries. Delete deletes syllable after the cursor but backspace delete letter by letter.

Consequences are that line by line layout is impossible because previous lines can affect the layout, you have to look at the whole paragraph. The width of a character changes depending on context you have to always measure length of a whole string. The concept of fixed width fonts looses all meaning for complex languages (Arabic, Indic). Drawing substrings (e.g. for selections) has to be done using clipping on the painter and drawing the complete string.

So layout of text is a complex process requiring specialist knowledge but this is Qt so the using programmer should not have to be aware of the complexities which should be encapsulated in an API.

Qt's process is: takes an input string with formatting information (a font) to a list of glyph and position pairs. Bidi and script analysis of the whole paragraph is done. Determine syllable boundaries in Indic scripts and apply reordering transformations within them. Convert all unicode chars to glyph indices using the fonts CMAP table. Apply transformations to the glyphs, various other bits I missed then the last part is positioning. Widths changed for kerning.

Qt 3 has a private API which won't change. QTextLayout class can layout paragraph or text using a font specified. Layout is done on a line by line basis. QTextItem does part of a line and converts cursor position. KDE needs to use this API is a few places.

Qt 4 has a public API with QTextLayout similar to Qt 3 and QTextLine describes a layouted line of text. QTextDocument and about 10 related classes is a richtext API.

Code examples were given.

We note that since this presentation was done in a Qt program all the characters were shown correctly and properly laid out.

His final slide had lots of Thank yous in various scripts. It uses one logical QFont but that loads "physical" fonts as needed.

Does it support maths output? Somewhat but maths needs context such as MathsML.