Using Unicode characters on the DVB-MHP platform.

William Overington

Thursday 16 January 2003

In January 2003 I started a new thread in the discussion forum at http://forum.mhp.org entitled "Using Unicode characters on the DVB-MHP platform.".

I thought that it might be interesting to add a transcript of the posting into this sequence of documents.

The posting is the text that appeared.

The transcript consists of the date and time recorded with the posting, followed by the posting itself.

In relation to languages of the Indian subcontinent, further reading of Chapter 9 of The Unicode Standard, Version 3.0, which is available in .pdf format from the http://www.unicode.org website, leads me to think that the character sequences to be replaced may typically be three or more Unicode characters in length rather than the pairs of characters which I suggest in the transcript, as virama (vowel omission sign) characters will also be encoded in the incoming character sequences. However the .etf eutocode typography file format can handle this correctly as the .etf format is not locked in to sequences of any particular lengths.

2003/01/04 14:06

In view of the possible problems of displaying some Unicode sequences upon the DVB-MHP platform I have been thinking of a way to program round the problems within a Java application. I have devised what I feel is an interesting solution which is capable of wide application, which could be used by DVB-MHP content developers almost as if it were part of the DVB-MHP standard if they so choose (even though it is not part of the DVB-MHP standard), yet which forces no one to do anything, so that anyone need not use the technique at all if he or she chooses not to do so.

Here is a draft of what I have in mind. This consists of two types of files, .etf and .tff, both devised for the purpose. They are both, computationally, sequential access Unicode plain text files.

.etf eutocode typography file

A .etf file consists of lines of text in a sequential text file.

The last line consists of a U+EBEF character on its own.

All other lines consist of one or more (often two) Unicode characters from plane 0 followed by a U+EBEF character followed by zero or more (often one) Unicode characters from plane 0.

.tff text font file

A .tff file consists of two lines of text in a sequential file.

The first line of a .tff file contains the name of a text file or of a .uof file in which the name of a text file can be found (Please see the thread "Using the U+FFFC character on the DVB-MHP platform." for details of a .uof file).

The second line of a .tff file contains the name of a .etf file.

In order to produce a display, the text file stated in the .tff file or a .uof file is opened.

In that text file the characters are all Unicode characters. There are broadly six categories of characters which could be found in my way of using the Unicode system for the DVB-MHP platform, though some of those categories may not be being used in a particular text file. The reason for non-use may be because the particular Java program being used does not have facilities to process them or because although the particular Java program being used does have the facilities, the particular application does not need their use.

The six categories are as follows.

1. Unicode characters for which a glyph in a font can be used directly. This covers the vast majority of characters used for most languages and many symbols. In any particular file, some, or indeed all, of these could be from the Private Use Area if so desired, though for most applications Private Use Area characters in this group will not be being used.

2. Unicode character sequences which need to be processed in order to produce a display, such as, for example, when some characters for the languages of the Indian subcontinent are used in certain pairs.

3. Surrogate pairs for a character not in plane 0.

4. Courtyard codes to signal colours and size of text.

5. Eutocode graphics.

6. Other eutocode codes which are not to be used as codes to access a font directly.

Please note that groups 4 through to 6 above are codes of my own devising within the Private Use Area.

In processing a text file which potentially contains all of the above six types of codes, the first part of the process would be to handle codes in the above sections 4 to 6 separately.

This should leave a string of Unicode characters which are intended to produce a monochrome text display.

The next stage of the process would be that for every line of the .etf file except the last line, in order, to look for every occurrence in the string of the one or more characters specified before the U+EBEF character in that line of the .etf file and to replace it with the zero or more characters specified after the U+EBEF character in that line of the .etf file. In practice, the string would usually get shorter as a result of the process. The string would then be displayed using the font supplied by the DVB-MHP platform.

The above processing could process characters in groups 2 and 3 above. Processing surrogate sequences would help in allowing a few characters from the higher planes of Unicode to be used with a 16-bit character display system.

The effect of this processing is that some of the character codes in the final string will be for glyphs which are not in the Unicode specification. With an advanced format font these glyphs would be sealed within the font and might not be directly accessible using a single character code. However, for the DVB-MHP system as presently specified it may be that character codes will need to be used on a local basis in order to be able to access such glyphs in a font of the type in use. The choice of character codes is important in that they should be from the Private Use Area so as to avoid any clashes with uses of regular Unicode. In principle any codes from the range U+E000 through to U+F8FF could be used for any particular application. However, as this is a system which could potentially be used throughout the world on DVB-MHP platforms perhaps some form of consistent use amongst content developers would be helpful. After some thought on the matter I am suggesting that code points for the glyphs which are not directly encoded in Unicode (for Indian languages and maybe other uses) could be encoded starting at U+EC00 and working upwards towards U+EFFF. This allows for a possible 1024 glyphs to be so encoded, which will hopefully be enough for the purpose. This activity may need to be done by people interested in content development for the DVB-MHP platform if it is to get done.

Readers may notice that the code range U+EC00 through to U+EFFF is the same range of code points which I have chosen for inputting numerical data in the eutocode graphics system. In trying to decide where to locate the code points for the unencoded glyphs for languages of the Indian subcontinent, I wanted to be able to provide adequate space so that a coding system could be produced where codings did not overlap from one writing script to another writing script so that several language scripts could potentially be stored in the same font, and I also wanted to avoid clashing with other uses of the Private Use Area which I am specifying in my eutocode system which I am hoping will become widely used on the DVB-MHP platform. However, due to the relatively small size of the Private Use Area in plane 0, namely 6400 code points, after some thought I decided that the codes for the special glyphs could coexist in the same range of code points as the eutocode graphics data codes. Not only does this avoid pressure on other code points in the Private Use Area but it also provides a shield against the special glyphs being used for interchange of information in files, as for interchange in files the meanings of these code points are to be for the eutocode graphics data. Only within a special display generation section of a running Java program are these code points to have the meanings of the special glyphs: that causes no conflict with the use of the codes for data for eutocode graphics as when they are being used in that manner they do not display as glyphs when being used in an end user application.

So, for example, suppose that some particular language has characters in regular Unicode in a range U+XX00 through to U+XX3F, where XX here represents some particular pair of hexadecimal characters.

Perhaps, for example, U+XX03 and U+XX17 when encountered together need to be changed so as to display a special glyph. Suppose that that particular special glyph is encoded, by informal agreement amongst content authors and font developers who use the DVB-MHP platform, as being at U+EC05.

Then the corresponding line in the .etf file would be as follows.

U+XX03 U+XX17 U+EBEF U+EC05

The .etf file would contain as many entries as needed for the particular language or languages which that particular Java program was set up to process.

Any attempt to use U+EC05 directly in an interchanged file would not work as U+EC05 would be treated, in a DVB-MHP program which recognized eutocode, as a data code for the integer value 5, so the proper use of Unicode in interchanged files would be supported.

In assigning various code points in the U+EC00 to U+EFFF range for these special glyphs for display, a number of the codes could be left as private use so that if someone were producing a special font for a special purpose, such as using some symbol from a non-zero plane of Unicode as a plane 0 Private Use Area character so as to be able to produce a display, while still wishing to generally agree to using eutocode code point assignments on the DVB-MHP platform, he or she could achieve that result. An example could perhaps be assigning a code point in the plane 0 Private Use Area to a rarer Han character which is formally encoded in a higher plane, in order to display a particular poem in Chinese on a DVB-MHP terminal.

The .etf file could also be used to provide such facilities as being able to process sequences which include the U+200D ZERO WIDTH JOINER character, known as ZWJ, in relation to ligatures for use in English and German Fraktur.

By using the U+EBEF character as a marker in the .etf file, the opportunity is left open to be able to extend the capabilities of a .etf file to any other matters which may arise by using a different character as a marker.

William Overington

4 January 2003

Astrolabe Channel

This file is accessible as follows.

http://www.users.globalnet.co.uk/~ngo/ast03300.htm