A Private Use Area code point for a watermark-like memory for optional use in the processing of code points for whole ligatures.

This document is about a Private Use Area code point for a watermark-like memory for optional use in the processing of code points for whole ligatures.

The use of this code point is entirely optional. It is, however, provided in case any end users who are using this collection of code points for ligatures would like to use it.

The code point allocation is as follows.

U+E7C1 WATERMARK-LIKE MEMORY THAT A WHOLE LIGATURE WAS ORIGINALLY USED FOR THE FOLLOWING LIGATURE

The idea is as follows.

Suppose, for the purpose of this explanation, that someone is transcribing an 18th Century English printed book into a computer database.

For work involving using the sorting of words, indexing of the book, authoring a dictionary of the words used in the book and so on one would ideally use, in order to produce a ct ligature, a c ZWJ t approach as the way to store a ct ligature.

This would mean that a ct ligature would be indicated by the three regular Unicode characters U+0063 U+200D U+0074 each time it occurred in the text.

However, for situations where someone does not have the more modern facilities available, so as to display a ct ligature on the screen when the c ZWJ t sequence is entered, using the golden ligatures collection code U+E707 would not, in my opinion, be unacceptable.

After keying in the text, a software utility could be used to convert the resulting file to a format where all of the ligatures were broken down into the indirect format using ZWJ characters. This technique could be used not only for the ct ligature but also for various long s ligatures as well. It seems to me to be a very beneficial solution all round.

One possibility is that various people might be separately transcribing 18th Century English printed books into computer files and then sending them to one central place where each of the files is added into a database of transcriptions of 18th Century English printed books, so as to make a collection of such transcriptions.

Now, in relation to having a WATERMARK-LIKE MEMORY THAT A WHOLE LIGATURE WAS ORIGINALLY USED FOR THE FOLLOWING LIGATURE code. That code would ideally be a regular Unicode code and would display as zero width and would be ignored as regards significance in sorting and collating and so on. That would seem to me to be reasonable as there are some complete ligature characters already encoded in regular Unicode, so such a code being in regular Unicode, even if it were also sometimes used with Private Use Area encodings of ligature characters, seems reasonable.

However, at present, such a code does not exist in regular Unicode. So, I am publishing in this series of documents this Private Use Area code point allocation in the hope that it will be of interest, and maybe of practical use, to some readers.

My reasoning for suggesting such a code is that if a database is taking in ligatures expressed in ZWJ format and storing them directly and is also taking in ligatures expressed as a code point representing a complete ligature, then converting them to ZWJ format and then storing them, it could possibly be the case that the owner of the database might like to keep a record of whether the ligature arrived in one form or the other. Now, it might be that the owner of the database would choose not to record how the original coding was made, but he or she might wish to record such information as part of the provenance of the database. So, in order to provide for the possibility that the owner of such a database does wish to preserve a record that the original document used a whole ligature code rather than a ZWJ sequence, I suggest the WATERMARK-LIKE MEMORY THAT A WHOLE LIGATURE WAS ORIGINALLY USED FOR THE FOLLOWING LIGATURE code. If that code is ever implemented in regular Unicode it will probably have a different, shorter, name. Yet for this document and for experiments, where experimental software needs to have clearly commented source code, such a name for the code point is not unreasonable.

So, suppose that someone is adding to a database a text file containing the word astrolabe including a ligature for the st. Please note that the st ligature is U+FB06, a regular Unicode code point. For the purpose of this document let us please use WLMTAWL to stand for the WATERMARK-LIKE MEMORY THAT A WHOLE LIGATURE WAS ORIGINALLY USED FOR THE FOLLOWING LIGATURE code point value.

My thinking is that if the word astrolabe arrived as asZWJtrolabe then it is stored as asZWJtrolabe in the database, yet if it arrived as aU+FB06rolabe then it is stored as aWLMTAWLsZWJtrolabe in the database. Thus either method of keying the st ligature can be used, both methods result in the archive storing alphabetically sortable text and in addition the fact that a whole ligature character was used in the original document is recorded in the database.

The database files could, if it were so desired, be searched by a specially written program by the database manager so as to find out the answer to such a question as the following.

For all of the ligature codes used in documents added into the database, how many were keyed using ZWJ codes and how many were keyed using codes for whole ligatures?

In order to find the answer to this question the software would simply look for ZWJ occurrences and determine whether or not a WLMTAWL code was present immediately preceding the first character of the ligature sequence.

So, my idea for a WATERMARK-LIKE MEMORY THAT A WHOLE LIGATURE WAS ORIGINALLY USED FOR THE FOLLOWING LIGATURE code is basically quite straightforward and could be easily used to good advantage. However, its use would not be obligatory, so that if, say, a database manager has no interest in whether the original of a document used a ZWJ sequence or a U+FB.. or a golden ligatures collection code for a ligature, then the WATERMARK-LIKE MEMORY THAT A WHOLE LIGATURE WAS ORIGINALLY USED FOR THE FOLLOWING LIGATURE code need not be used at all in that particular database application.

Naturally, it would be best if such a code were part of regular Unicode and, at some future time, if more ligatures are encoded in regular Unicode then maybe it would be added as part of the same process as the adding of the ligatures, yet, thinking that perhaps some people might like to try out some programming experiments with the technique now, I suggest a particular code within the Private Use Area in the hope that if various people try out such programming experiments, then hopefully any files produced could be interchanged from experimenter to experimenter as part of the research process: also, suggesting a particular code does provide a stepping stone so that an experimenter has a definite place to start.

Now there is the matter that software used to break down whole ligatures into the ZWJ sequences needs to be programmed so as to know the ZWJ sequence into which each whole ligature breaks down.

It is a choice for programmers of such software whether to encode just some of the ligatures, as thought necessary for the particular task for which the piece of software is intended, or whether to encode all of the ligatures in both regular Unicode and the golden ligatures collection.

However, programmers are invited to consider that if they choose to encode all of the U+FB.. ligatures of regular Unicode and all of the golden ligatures collection ligatures in this manner within their programs, then those programs might potentially be of use in wider contexts than originally thought.

William Overington

2 July 2002


 

This file is accessible as follows.

http://www.users.globalnet.co.uk/~ngo/ligwater.htm