No subject

Fri Aug 6 15:17:01 PDT 2004

"Content vector format
. A case-insensitive field name that may consist of ASCII 0x20 through
0x7D, 0x3D ('=') excluded. ASCII 0x41 through 0x5A inclusive (A-Z) is to
be considered equivalent to ASCII 0x61 through 0x7A inclusive (a-z). 
. The field name is immediately followed by ASCII 0x3D ('='); this equals
sign is used to terminate the field name. 
. 0x3D is followed by 8 bit clean UTF-8 field contents to the end of the
field."
                                  ^^^^^
This says it's UTF-8, and I think that's a very good decision.  This means
we don't have to deal with DBCS encodings: disgusting, mostly obsolete
beasts.  (This helps on embedded devices, too--you don't have to support
every encoding under the sun, just UTF-8.  The limiting factor will
probably be fonts.)

> Here is an example of what I mean, taken from a recent message to the
> debian-devel mailing list:
> 
> From: =?ks_c_5601-1987?B?x+7FuMDM?= <dkfjskd-dd at hotmail.com>
> To: debian-devel at lists.debian.org
> Subject: =?ks_c_5601-1987?B?W7GksO1dIMfuxbjAzMO1sbk=?=
> 
> Here is what that showed up as in mutt:
> 
> From: \307\356\305\270\300\314 <dkfjskd-dd at hotmail.com>
> To: debian-devel at lists.debian.org
> Subject: [\261\244\260\355] \307\356\305\270\300\314\303\265\261\271
> 
> But in pine it some how magically showed up as Korean glyphs.

This is the old way of doing arbitrary encodings in mail.  UTF-8
obsoletes it.  (If you don't know about UTF-8, I strongly suggest
becoming familiar with it; http://www.cl.cam.ac.uk/~mgk25/unicode.html
is a good start.)

The main reason most people don't use UTF-8 as the default encoding in
mail is because older MUA's don't support it.

> So, since we already have an RFC approved standard (I'm assuming; I've
> been seeing these types of emails for years) for mixing foreign glyphs
> with real text, lets use it.

This RFC is for email, and it's an old, ugly way of doing things that
UTF-8 supercedes in most ways.  For example, you can cat a mailbox
in which all mails have been converted to UTF-8, directly, and you see
everything as it's supposed to be seen (except for the glyph issues);
try to cat an mbox containing varying encodings and you'll get junk.
(Well, if you're not on a UTF-8 terminal you have to pipe it through
iconv; but can only do that with UTF-8 and other Unicode encodings.)

Also, if you want simple, you *don't* want MIME in the tags.  UTF-8 for
everything is extremely simple (you can even ignore the "lang" tag if
you, as an implementor, don't care about the glyph problems).  With
arbitrary encodings, everything gets more complicated.  I think that
your own Mutt binary failing to decode it properly is a good indicator. :)

(An aside: mutt *should* be able to figure out anything pine can; you
might have a mutt without iconv or MBCS support.  mutt -v should
probably list HAVE_WC_FUNCS and HAVE_ICONV.)

> For the tags themselves, they are standard, and they're staying that
> way.  I'm not going to encode CONDUCTOR into Chinese.  Because its a
> standard tag, the player can translate it if it wants to.  And I see
> no reason why a Chinese language encoder couldn't take their equivalent
> of "conductor" and encode it as the CONDUCTOR tag in the ogg file
> itself, making it invisible to the Chinese speaking user.

No argument there; the actual tag names should be completely invariant.
They're for interpretation by a parser, not a user.

-- 
Glenn Maynard

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.