[vorbis] TAG Standard - ENSEMBLE/PERFORMER tags

Glenn Maynard g_ogg at zewt.org
Mon Jan 7 15:54:25 PST 2002



By the way, I suggest a reference to ISO-639 for the UTF8_LANG tag.

On Mon, Jan 07, 2002 at 03:06:55PM -0800, Jonathan Walther wrote:
> Thats what I mean. How is it different to convert an RFC2047 tag into a
> string of ? characters?

If Chinese text is encoded using BIG5, and you don't support it, you
can't display it at all.  If it was encoded in UTF-8 and marked in some
way as Chinese, you could display it correctly.  If it wasn't marked
by language, you could display it with the user's default font.  A
Japanese person would see the Chinese text with a Japanese font.  (Like I
mentioned, this is very often acceptable, at least to a Japanese user.
Presumably, a Chinese user's default font would be a Chinese one.)

> >Using another font is what Explorer does.  Load
> >http://zewt.org/~glenn/test.html with Explorer on a system with both
> >Japanese and Chinese support installed, and both characters (which are
> >the same in Unicode) will display properly.  All it needs to know is the
> >language the text is in.  (That's the best way to do it.  It's also the
> >most complicated of these to implement, which is why my terminal doesn't
> >do it.  Doing it by embedding multiple encodings is more complicated
> >still.)
> 
> Can you explain this?  If the same glyph represents different characters
> in Chinese, Korean, and Japanese, how is it that Explorer knows which
> one the glyph is?  If you can explain that, and it doesn't involve
> embedding html, my objections to straight UTF-8 will be withdrawn.

As far as I know, all you need to render the correct character is the
language it's in.  (That's what you get with HTML's LANG attribute.)
This seems to be a major premise behind HAN unification.

If this is to be fixed with an incompatible change (like RFC2047), then it
might as well be as simple as possible.  (I agree that embedding HTML
would be a bad idea; I doubt anyone will disagree there.)

That's one of my big problems with this: it's using encodings where the
only goal is to know what language it's in so the right font can be
used.  That's introducing a lot, when all you *really* need is the
language code.

For example, you could do it in an XML-like way:

TITLE=<ja/>ç›´<zh/>ç›´

This doesn't introduce anything major; I think its use would be fairly
evident from that example alone.  (It differs from the more complicated
way HTML does it in that it simply *sets* the language; it doesn't set
it for a block, so you don't have to keep a stack of languages.)  The
only thing that needs to be made clear is that "<ja>text</ja>" is *not*
allowed.  Of course, anything like this would work ("#JA#text", escaping
#); the nice thing about this is that everyone already knows how it
works.  (This would obsolete UTF8_LANG.)  If it wasn't supported,
anything between < and > could be skipped and you'd still have
reasonable output, unlike RFC2047.  (And you'd still get readable
output if you didn't know about it at all and dumped it as raw output.)

A problem with this example is escaping < and >.  It's not a big deal to
have people use &gt;, &lt; and &amp;, as people are already well-familiar
with those three escapes--but this would *not* be optional.  (However,
if anything is to be added which is not optional, this is as simple as
it gets.)  No matter how this is done, escaping will be needed, though.
(For RFC2047, encoding text that looks like RFC2047 in RFC2047 is the
equivalent of escaping.  It's more complicated, though, since you need
to parse through the text being edited for RFC2047-like strings.)

I'd still rather see just UTF8_LANG, since it truly does not introduce
anything incompatible and doesn't require any parsing at all.  If you
ignore it, you'll just display the glyphs in the user's font, which is
*probably* what he wants anyway.  (I think anything else--RFC2047,
<ja/>, etc--is likely to be rejected anyway, on the grounds that the
tags shouldn't require any parsing.)

It's fundamentally impossible to put anything in the tag data portion
except literal data, without requiring any parsing, so UTF8_LANG is as
good as you can get under that restriction.


-- 
Glenn Maynard

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.




More information about the Vorbis mailing list