May 9th, 2004

CircleID on RFC 3743

»

CircleID asked me to write an article on our recently published RFC 3743. Check it out and let me know what you think (apart from the grammer mistakes which I keep finding everytime i read it again :-)

It is difficult to explain RFC 3743 or commonly known as the Joint Engineering Team (JET) Guidelines without some lesson on Chinese, Japanese and Korean (CJK), particularly how it relates to Internationalized Domain Names (IDN). Luckily, an Internet-Draft [PDF] we wrote back in 2001 discusses the issues quite neatly in this context.

In brief, Chinese characters (Hanzi) or Han ideographs are evolved from pictographs (writing made up of pictures) across thousands of years. Unlike other writing systems, Han Ideographs are constantly evolving. In the 1950s, China underwent a major exercise to simplify the Chinese writing using an almost systematic process. The resulting simplified form or Simplified Chinese is now being used in China and Singapore while the original form or Traditional Chinese is still being used in Taiwan, Hong Kong, Malaysia and most oversea Chinese communities.

Because of the almost systematic simplification process, there is a somewhat 1-to-1 matching between Simplified Chinese and Traditional Chinese. It is easy to associate it to the like of uppercase/lowercase in English but at best, it is a bad analogy that grossly underestimates the depth of the problem.

If that is not complicated enough, Han Ideographs are also used to write Japanese (Kanji), Korean (Hanja) and old Vietnamese (Chu Han and Chu Nom), and each language has its own simplification history and rule. In addition, there are many Han Ideographs that look exactly the same (CJK Compatibility) or have similar looks (zVariants) but assigned different code points in Unicode.

Some readers may point out that the IETF standard for IDN, RFC 3490 (IDNA) includes RFC 3454 (stringprep) and RFC 3491 (nameprep) that specify a normalization process that “will allow users to enter internationalized domain names into applications and have the highest chance of getting the content of the strings correct.” But unfortunately, it does not handle any normalization for Han Ideographs as there isn’t any universal well-defined matching/folding rule for Han Ideographs across CJK. What works for one language will inevitably mess up another.

Hence, the next best solution is to ensure CJK IDNs are handled correctly in the registries or registrars; RFC 3743 is the result of a two-year effort by JET to define an administrative guidelines for CJK IDNs. While RFC 3743 was published only 2 weeks ago (April 15, 2004), it has been in circulation within the community for a long time and is highly regarded. In fact, ICANN IDN Guidelines is heavily influenced by the JET Guidelines and today, IANA maintains an IDN Characters Tables Registry for RFC 3743.

Without going into the gory technical details, RFC 3743 introduces two main concepts in CJK IDN administration:

The first is a set of valid characters code points that can be registered. There are nearly 70,000 Han Ideographs in Unicode 3.2 and over 96,000 in Unicode 4.0 and only a fraction (between 1200 to 9000 ideographs, varying between countries) is commonly used. Therefore, it is more sensible to restrict the CJK IDN registration to a subset of Han Ideographs than to allow any code point to be registered.

The second is administrating CJK IDNs as a “package”. Due to simplifications and variants, each Han Ideograph may have 1 or more variants. For example, a Traditional Chinese IDN would have a corresponding Simplified Chinese IDN and it would be confusing if the two IDNs are been registered to different parties.

Thus, to avoid potential conflict, when someone registers a CJK IDN, the other related variants (generated using the algorithm defined in the JET Guidelines) are also marked reserved. These variants are administrated as a single atomic “package” i.e., if one IDN is deleted, the whole package is deleted and when one IDN is transfered, the whole package is transferred.

Examples of possible Legend Group in Chinese For example, if Legend Group register their Traditional Chinese IDN, the Simplified Chinese IDN and all its possible permutations would be reserved (see figure below). All these IDNs are considered “equivalent” and administrative treated as a single atomic “package”.

Incidentally, for those who can read Chinese, you will realize that some of the variants mentioned above are not commonly used or don’t make sense. The algorithm utilizes the dumb code point permutation. More intelligent algorithm could be designed using CJK lexemic and orthographic data but would be very expensive to implement. The goal of the algorithm is not to generate meaningful variants but ensuring the potential variants are reserved to minimize potential dispute.

Finally, I would like to point out that it is common for a Han Ideograph to have 2 or more, and in some rare cases, more than 10 variants. So for an IDN with 10 Han Ideographs, there could be 100 to 10,000 variants. The cost to register all these variants one-by-one to “protect” just one IDN and the possible domain dispute if not done properly is left as an exercise for the reader. The importance of the RFC 3743 or JET Guidelines would be obvious thereafter.

Comments are closed.