[vorbis] UTF8_LANG: a much better idea

Jonathan Walther krooger at debian.org
Thu Jan 10 13:55:05 PST 2002


On Thu, Jan 10, 2002 at 06:06:49AM -0500, Glenn Maynard wrote:
>A technical description is at http://www.unicode.org/unicode/reports/tr27/#tag
>which, like all specs, makes it sound a bit more complicated than it
>really is.

Thank you Glenn.  I have updated the proposal in light of your
information.  Excellent research.  It now reads like this:

Character Set Encoding of Tags:
===============================

UTF-8 is the default encoding for tag data.  Unfortunately UTF-8 muffed
it for Asian languages by doing the equivalent of giving the same
character codes to English, Russian, and Greek letters.  So originally
we were going to let people use RFC2047 encoding, or a UTF8_LANG tag.

Fortunately UTF-8 itself has an internal, standard solution to the
problem:
    http://www.unicode.org/unicode/reports/tr27/#tag
which basically says: mark the language of text with U+E0001 LANGUAGE  
TAG, followed by the RFC 3066 language ID (ie. "ja") encoded in
lowercase ASCII plus 0xE0000.  This is the only mechanism recognized
by the standard.

Programs which don't want to interpret such markup can simply ignore it; it
is zero width.  The scope of the language setting is until the end of the
tag, or until a new language setting is encountered, whichever comes first.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: part
Type: application/pgp-signature
Size: 797 bytes
Desc: not available
Url : http://lists.xiph.org/pipermail/vorbis/attachments/20020110/2b6eb668/part-0001.pgp


More information about the Vorbis mailing list