Passing information forward from sentence to sentence.

William Overington

Copyright 2002 William Overington

Tuesday 22 October 2002

An interesting feature of localization using the comet circumflex system is the passing of information forward from sentence to sentence.

Suppose for example, one has the following two sentences, one after the other in the order shown here.

Here is a bird.

It is white.

How should the word "It" be localized in a target language. Many languages have grammatical gender for nouns. Thus the second sentence could be translated into what would look by a direct "word for word" translation into English as any of "It is white.", "He is white.", "She is white." depending upon the language and what is the grammatical gender of the noun "bird" in that language. In addition, in some languages the translation of the word "white" in the above example will vary depending upon the grammatical gender of the noun which it describes.

However, the word "it" when used in the English sentence "It is raining." does not refer to a specific noun, so there is no forward passing of gender information to such a sentence.

If the following two sentences are used, then consideration of a plural form for the word "white" would also be needed.

Here are some birds.

They are white.

This means that a target language which has three grammatical genders might need six different translations for the word "white" in the simple sentences mentioned above.

In addition, if the noun for which the word "white" acts as an adjective is not the subject noun of the sentence, but is the object noun or is an indirect object noun or some other grammatical case then, depending upon the target language, different words for translating the word "white" might be needed. In the comet circumflex system, for the moment, care is being taken so that a noun qualified by a parameter which is an adjective is the noun that is the subject of the sentence. This is so as to minimize the amount of information needed to be carried in support files for localization. Whether such a restriction is desirable in the longer term remains to be seen. However, it is regarded as a practical balance between accurate automated localization and being able to convey a wider range of information.

However, this restriction only applies where the adjective is a parameter of a noun in the comet circumflex system. A preset sentence can have an adjective qualifying any noun, as the localization will be specific.

The comet circumflex system has some suggestions for localization in that it is suggested that there are some variables defined to be used by the software for conveying information forward from one sentence to the next. As localization is a local matter, such variables need not be standardized within the comet circumflex system, and, indeed, it may be that any attempt at standardization would be unsuitable for all target languages. However, it may be helpful for discussions of the techniques of localization of text encoded using the comet circumflex system to include a broad attempt at making some generalized suggestions.

I have tended to use the term "voussoir variables" for these variables, in that they are figuratively part of an arch of information leading from one sentence to the next.

Suppose that there are three radio buttons to indicate the voussoir variable gender. These are m, f, n, standing for masculine, feminine and neuter.

Suppose that there are two radio buttons to indicate the voussoir variable quantity. These are s, p standing for singular and plural.

Thus the sequence of gender followed by quantity is a two-letter code.

The Unicode system contains a good selection of circled letters and it may well be convenient to use them as markers within the records of a localization database.

Please consider the following Unicode characters for use as markers within the records of a localization database for the comet circumflex system in order to indicate grammatical gender and also to indicate singularity and plurality.

U+24DC CIRCLED LATIN SMALL LETTER M (that is, a circled m)

U+24D5 CIRCLED LATIN SMALL LETTER F (that is, a circled f)

U+24DD CIRCLED LATIN SMALL LETTER N (that is, a circled n)

U+24E2 CIRCLED LATIN SMALL LETTER S (that is, a circled s)

U+24DF CIRCLED LATIN SMALL LETTER P (that is, a circled p)


Suppose that some colours are encoded as an indexed list in the comet circumflex system, so that those colours are known as being needed for localization.

index number for colour English localization
0 black
1 brown
2 red
3 orange
4 yellow
5 green
6 blue
7 violet
8 grey
9 white
10 turquoise
11 pink

A localization database could use the following coding for looking up the colour white, where it is the feminine plural form that is needed.

U+0039 U+24D5 U+24DF

That is, the number 9 followed by the codes for circled f and circled p.

The colour white as masculine singular could use the following coding in the localization database for colours.

U+0039 U+24DC U+24E2

That is, the number 9 followed by the codes for circled m and circled s.

This being just for colours where the noun which is being qualified is the subject of the sentence.

Here are a few sentences so as to carry out some experiments with the comet circumflex system.

Firstly some sentences which do not have a parameter.

comet circumflex code English localization
23011 Here is a bird
23012 Here is a tree
23013 Here is a flower
23111 Here are some birds.
23112 Here are some trees.
23113 Here are some flowers.
23211 There is a bird.
23212 There is a tree.
23213 There is a flower.
23311 There are some birds.
23312 There are some trees.
23313 There are some flowers.

Now some sentences which have one parameter.

comet circumflex code nature of parameter 1 English localization
23411 An integer representing a colour from the comet circumflex colour list. It is P1.
23511 An integer representing a colour from the comet circumflex colour list. They are P1.

Thus it is possible to have the following encoded in the comet circumflex system.

There is a bird. It is brown. Here are some flowers. They are yellow. There is a tree. It is green.

As an experiment here are some sentences with two parameters, where each of the parameters is a colour.

comet circumflex code nature of both parameter 1 and parameter 2 English localization
23412 An integer representing a colour from the comet circumflex colour list. It is P1 and P2.
23512 An integer representing a colour from the comet circumflex colour list. They are P1 and P2.

In order to send the message

There is a flower. It is red and orange.

one needs to send a comet circumflex key followed by the number 23213 followed by a comet circumflex on a screen, followed by a comet circumflex key followed by the number 23412 followed by a U+2460 character followed by the number 2 followed by a U+2461 character followed by the number 3 followed by a comet circumflex on a screen.

That is a sequence of the following 26 Unicode characters.

U+2604 U+0302 U+20E3 U+0032 U+0033 U+0032 U+0031 U+0033 U+2604 U+0302 U+20E2 U+2604 U+0302 U+20E3 U+0032 U+0033 U+0034 U+0031 U+0032 U+2460 U+0032 U+2461 U+0033 U+2604 U+0302 U+20E2

Here is a link to a web page which contains that sequence of characters, with no font specified for the sequence.

c_c00201.htm

 

The comet_circumflex system.

Copyright 2002 William Overington

This file is accessible as follows.

http://www.users.globalnet.co.uk/~ngo/c_c00200.htm