A possibility for communication through the language barrier

William Overington

Monday 5 October 2015

Recently I became aware of the PanLex project and I have been looking at the webspace for the project.

http://www.panlex.org/

I have thought of an application, as follows.

If PanLex were to assign to each word root a number, the same number for the word in each language, then it would become possible to encode each word in a language-independent manner within a plain text message.

All that would be needed would be for the Unicode Standard to encode one additional character,

PANLEX BASE CHARACTER

Then a plain text representation of each PanLex word root in a Unicode plain text message would be by using a sequence of a PANLEX BASE CHARACTER character followed by a sequence of Unicode tag characters.

The tag characters were encoded some years ago, then later deprecated; yet recently all but two of them have been recycled for use in a base character followed by a sequence of tag characters application by undeprectating them, the particular application that resulted in the undeprecation being to encode various flags of countries and regions and so on.

http://www.unicode.org/charts/PDF/UE0000.pdf

However, the technique of a base character followed by a sequence of tag characters can be used elsewhere.

I am suggesting it for use in communication through the language barrier using encoded localizable sentences.

http://www.users.globalnet.co.uk/~ngo/locsetag.htm

In relation to PanLex words, here is an example, the code number chosen is just for this example, it is not intended to go through to encoding.

For example, consider the word apricot.

Suppose that that is encoded as the number 72519430 in the PanLex listing.

Here the 7 is because there are seven digits following the 7, the 0 is because it is the general word, and the 251943 is just chosen as a number to distinguish the word apricot from other words.

The 251943 is just an example number chosen for this explanation.

The use of the leading 7 is because more frequently used words could have lower numbers.

For example, the word and could have the number, say, 12 and the word today could have the number, say, 245 and so on.

Returning to apricot.

There could be variations so as to disambiguate.

For example,

72519430 apricot

72519431 apricot (fruit)

72519432 apricot (tree)

72519433 apricot (colour)

72519434 apricot (flavour)

Suppose that the PANLEX BASE CHARACTER were to become encoded into Unicode.

Then an apricot in the sense of an apricot tree would be encoded into plain text using the following nine characters.

PANLEX BASE CHARACTER, TAG DIGIT SEVEN, TAG DIGIT TWO, TAG DIGIT FIVE, TAG DIGIT ONE, TAG DIGIT NINE, TAG DIGIT FOUR, TAG DIGIT THREE, TAG DIGIT TWO

The commas are just for clarity in this description, they are not in the message.

This seems a lot, yet this could be from a cascading menu system where the person sending the message would select the word apricot (tree) from a menu and the software would insert the nine characters into the message automatically.

This would open up lots of possibilities for communicating through the language barrier.

For example, suppose that one wishes to send through the language barrier the following sentence.

The apricot tree is beautiful.

One would need to construct the sentence.

There could be localizable sentences for constructing such a sentence.

For example, the following.

The subject noun of the sentence being constructed is as follows.

That would be encoded as LOCALIZABLE SENTENCE BASE CHARACTER followed by a sequence of tag characters to represent the code for the localizable sentence.

That localizable sentence would be followed by PANLEX BASE CHARACTER followed by a sequence of tag characters.

LOCALIZABLE SENTENCE BASE CHARACTER, TAG DIGIT SIX, TAG DIGIT SEVEN, TAG DIGIT ZERO, TAG DIGIT ZERO, TAG DIGIT ONE

PANLEX BASE CHARACTER, TAG DIGIT SEVEN, TAG DIGIT TWO, TAG DIGIT FIVE, TAG DIGIT ONE, TAG DIGIT NINE, TAG DIGIT FOUR, TAG DIGIT THREE, TAG DIGIT TWO

That is a lot of characters to send the message that the subject noun of the sentence being constructed is an apricot tree, yet whereas localizable sentences are intended for a limited set of sentences, (some just for friendly chat; some for serious applications, such as, for example, seeking news of relatives and friends after a disaster) this combination has the potential to send a message involving any words.

I mentioned a localizable sentence as follows.

The subject noun of the sentence being constructed is as follows.

There could be a number of similar sentences, such as the following.


The direct object noun of the sentence being constructed is as follows.

The verb of the sentence being constructed is as follows.

The verb is in the present tense.

The verb is in the future tense.

The day of the week is as follows.


This technique is interesting because it puts the activity of localizing the sentence about the apricot tree into the target language, as the activity of the recipient of the message, who would typically be a native speaker of the language.

Also the same message could be sent to more than one recipient, where each of the recipients may not necessarily localize the sentence into the same target language.

I hope that this is of interest.