An infrastructure to enable end users to enter 21 bit unicode characters for use on the DVB-MHP platform.

William Overington

Tuesday 12 February 2002

The unicode system is today a 21 bit system. Details are at the http://www.unicode.org website.

The Java platform as used in the DVB-MHP system uses 16 bit characters.

The unicode system has facilities for expressing those 21 bit characters which have an integer equivalent value greater than 65535 as a sequence of two surrogates, known as a surrogate pair: each surrogate having a 16 bit unicode value and thus able to be treated within software as if it were a 16 bit unicode character, with the effect that each surrogate can be stored in a Java char and stored within a Java String.

Thus a string of text which uses characters from the 21 bit unicode range can be expressed within a Java String which uses 16 bits for each character. The Java String will be longer by one character for each surrogate pair used.

In a Java program which is running on a DVB-MHP platform, it might be desired to manipulate within the program characters which are not available on a conventional keyboard and sometimes to have those characters entered by the end user.

In many DVB-MHP terminals there will not be an ordinary keyboard, just the twenty buttons of the minimum set of input events for the end user to use.

There are issues concerned with entering characters and issues concerned with interpreting characters by the Java program for such purposes as displaying characters on a screen.

Java provides the \uhhhh mechanism for entering a 16 bit unicode character into a character or a string at compile time, where each h is any hexadecimal character. A sequence of two appropriate \uhhhh sequences may be used to enter a sequence of two surrogate characters into a Java String at compile time.

Two formats for inputting Java characters at run time are suggested. The methods essentially mean, in the parlance of unicode, that the strings are input as fancy text rather than as plain text. However, this particular fancy text needs just a little processing to be expressible as unicode plain text using only 16 bit characters.

The format is simply that any 16 bit unicode character may be expressed as a sequence of six characters 'uhhhh where each h is any hexadecimal character and that any 21 bit unicode character may be expressed as a sequence of eight characters 'Uhhhhhh where each h is any hexadecimal character such that hhhhhh represents a hexadecimal number in the range 0 to 10FFFF. These formats may be mixed in with ordinary unicode characters which directly represent themselves, provided that each use of the 'uhhhh or 'Uhhhhhh format is complete in itself, that is, each 'uhhhh sequence is no more and no less than six characters long and that each 'Uhhhhhh sequence is no more and no less than eight characters long.

In the above description of h as representing any hexadecimal character, the characters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F, a, b, c, d, e, f should all be recognized as hexadecimal characters when 'uhhhh and 'Uhhhhhh sequences are being interpreted.

The use of the ' character for the 'uhhhh and 'Uhhhhhh formats rather than a \ character is so that data in the 'uhhhh and 'Uhhhhhh formats will just be passed through the system software and that no built in features for handling \ characters will affect the data being entered by the end user.

Where a DVB-MHP terminal does not have a keyboard attached and the end user has only the twenty buttons of the minimum set of input events available for use, these formats can still be used using a Java program which has been received over the telesoftware broadcast link where that Java program has within it an Astrolabe Channel numerical pointer which has implemented within its unicode courtyard the optional addition which allows the generation of strings and string events. The end user can then use the unicode courtyard of the Astrolabe Channel numerical pointer to enter the unicode character and the Astrolabe Channel numerical pointer can produce a string event which has an accompanying data string containing the characters input using the unicode courtyard in the 'uhhhh or 'Uhhhhhh format as appropriate and then sending it on to the rest of the program just as if the sequence containing 'uhhhh or 'Uhhhhhh had been entered at a keyboard. Recognizing that some Java programs may process the characters input at a keyboard one character at a time, the new optional additions to the unicode courtyard of the Astrolabe Channel numerical pointer also include the facilities to easily generate character events which include the characters ' U u enter and the sixteen hexadecimal items.

Here is a hyperlink to a document detailing the optional additions to the unicode courtyard of the Astrolabe Channel numerical pointer.

The optional additions to the unicode courtyard of the Astrolabe Channel numerical pointer.

A parlance using six formats for a string is suggested for use with this infrastructure. This enables accurate discussion of what is happening in any given situation.

The formats are as follows.

ordinary format

ordinary plus format

super format

super plus format

hyper format

hyper plus format

The "ordinary" designation means that no 'uhhhh sequences and no 'Uhhhhhh sequences are included in the string.

The "super" designation means that some 'uhhhh sequences may be included in the string but that no 'Uhhhhhh sequences are included in the string. Any particular string that has a super designation may not actually contain any 'uhhhh sequence, the designation refers to the fact that processing must be carried out with the possibility that it does in mind.

The "hyper" designation means that some 'uhhhh sequences may be included in the string and that some 'Uhhhhhh sequences may be included in the string. Any particular string that has a hyper designation may not actually contain any 'uhhhh or any 'Uhhhhhh sequence, the designation refers to the fact that processing must be carried out with the possibility that it does in mind.

The "plus" designation is used when surrogate characters, which represent 21 bit unicode values, may be included in a string. Any particular string that has a plus designation may not actually contain any surrogate pairs, the designation refers to the fact that processing must be carried out with the possibility that it does in mind.

The formats may be converted by software as follows.

super format can always be converted to produce ordinary format.
super plus format can always be converted to produce ordinary plus format and may, if there were in fact no surrogate pairs in the string, be converted to produce ordinary format.
hyper format can always be converted to produce ordinary plus format and may, if there were in fact no 'Uhhhhhh sequences in the string, or no 'Uhhhhhh sequences which produced a unicode value other than those that could be represented by a 16 bit unicode value, be converted to produce ordinary format.
hyper plus format can always be converted to produce ordinary plus format and may, if there were in fact no 'Uhhhhhh sequences in the string, or no 'Uhhhhhh sequences which produced a unicode value other than those which could be represented by a 16 bit unicode value, and there were in fact no surrogate pairs in the string, be converted to produce ordinary format.

It is recommended that conversion of super plus format, hyper format and hyper plus format is first made with a presumption that the result will be a string in ordinary plus format. A check can then be made to determine whether, in fact, the string in ordinary plus format can be regarded as being in ordinary format.

It is also possible to check a string which starts off being presumed to be in ordinary plus format in order to find out whether it can be regarded as being in ordinary format.

It is also desirable to check, the end user having used 'uhhhh or 'Uhhhhhh sequences to enter characters into a running Java program, that the character string resulting has no lone surrogate characters which do not constitute a proper surrogate pair.

The same 21 bit unicode input formats may be used for setting up a string at compile time if desired. The 'uhhhh format simply means extra processing for the Java program when it is running. The 'Uhhhhhh format does mean extra processing for the Java program at run time, though the use of the 'Uhhhhhh format does mean that a Java programmer does not need to convert any 'Uhhhhhh characters to surrogate pairs when preparing the program and that the source code listing may be clearer as to which 21 bit unicode characters are being used.

Once a string has been converted to either ordinary format or ordinary plus format, the Java program must then have facilities to enable display of the intended characters. That is either straightforward or complicated depending upon which particular characters are in use, for some unicode characters are included in a required font in a minimum DVB-MHP terminal and there are facilities to obtain a font file from the object carousel of a DVB-MHP channel, provided that the font is being broadcast. Care must be taken with characters which use surrogate pairs as the font file format may not handle such characters. A way round when a small selection of characters from the 21 bit unicode set is being used might be to broadcast a specially made font file where those characters are mapped to some of the characters in the unicode 16 bit private use area and also to map within the inner working of the Java program those characters to the same characters in the unicode 16 bit private use area. Thus the end user would use the regular 21 bit unicode format and the Java program would, privately and internally only, use a localized 16 bit private use area encoding of the same characters in order to produce the display as if using 16 bit unicode characters.

There is also the possibility that a Java program can have its own set of .gif files to provide some specialized character sets and a Java program may also draw a character set from within its own software if that is considered to be the best solution for that particular application of some specialized character set.

Astrolabe Channel

This file is accessible as follows.

http://www.users.globalnet.co.uk/~ngo/ast02900.htm