[vorbis] TAG Standard - ENSEMBLE/PERFORMER tags

Glenn Maynard g_ogg at zewt.org
Wed Jan 9 16:24:52 PST 2002



On Wed, Jan 09, 2002 at 05:50:35PM -0500, Peter Harris wrote:
> Oh, ewww. It's even worse than I thought it was.

The only really bad thing about it is that, on a non-NT system, you
can't really interface with the user outside of the codepage.  At all.
(Win32 widgets, main(), etc all deal either in the codepage or Unicode.)

> > Get main() as small as possible, write a Unicode and ANSI version of
> > main(), and have the Unicode version convert to UTF-8.  This could be
> > done with no duplication of code, ie:
> 
> Right now, oggenc works entirely in the local character set, and only
> converts to UTF8 at the last possible instant.

That's probably the Right Thing.

I think the trick is to realise there are multiple settings for
encoding.  You have LC_CTYPE, the encoding string buffers, etc. use.
You also have LC_MESSAGES, the encoding that printed messages are in.
In Unix, they should be honored as usual.  In Windows, we don't have
those, so we can set LC_CTYPE to UTF-8 to tell the comment code that
it's being passed UTF-8, and set LC_MESSAGES to the codepage.  (Though,
libraries probably shouldn't deal with literal strings much anyway; they
should just set error constants and let user code decide what to print.)

> Does getopt*.c already work on UTF-8 input? If that is the case, I'd argue
> for vorbiscomment and oggenc to be rewritten to use all UTF8 internally.
> Then it's very simple to use wmain and the UCS2->UTF8 converter for windows
> systems, and main with ICONV on Unix. Both wmain and main would then call
> real_main (which would, of course, expect UTF8 only).

There's nothing for it to use.  That's the nice thing about UTF-8:
--馬鹿 can be treated as a regular C string, knowing nothing about the
encoding at all.  (Unless you need to do something like collate it, or
"compare the first N characters", but I doubt getopt does that.  You can
do strchr(buf, '=') on a UTF-8 string just fine.)

Here's the trick, though: we need to make sure that the libraries we end
up with are capable of running in any encoding.  If someone's in a
Shift-JIS encoding, and is writing an editor using this code, he'll
expect it to be able to take Shift-JIS input to the functions.  Looking
at the files, this means vcedit.c needs to be able to deal with
arbitrary encodings; it's the file acting as the "library".

It needs to know what type of data it's being passed in functions (this
is LC_CTYPE), and needs to call iconv (or whatever wrappers the tools
use) to convert it from LC_CTYPE to UTF-8; it needs to do the opposite when
returning data.  (In the case of the Win32 wrapper main(), it wouldn't have
to do anything, since LC_CTYPE is UTF-8.)

> I like the idea. It is much cleaner than what I was doing previously.
> 
> The only 'gotcha' I can see right now is error messages that quote bits of
> the command line (or, even worse, file names that are passed in on the
> command line). "Code Page X -> UCS2-> UTF8 -> UCS2-> Code Page X" _should_

Don't think of UCS2 and UTF8 as different encodings.  They're just
different representations of Unicode.  UCS2 -> UTF8 -> UCS2 -> UTF8 ->
UCS2 -> UTF8 -> UCS2 -> UTF8 -> UCS2 will produce the exact same text on
any working implementation.

> produce identical output to the input on the command line, but I'm not sure
> exactly how far I can trust MultiByteToWideChar() <-> WideCharToMultiByte()
> (or a double ICONV for Unix people).

In Unix, filenames should never be touched, not when printed and not
when opening.  Take what I put on the commandline and open it literally;
if you print it, print it literally.  (The assumption is that filenames
are in the same encoding as the terminal.  This is usually true--at
least for encodings with 7-bit ASCII as a subset, which is just about
all of them.  If it happens to be in another encoding, that's OK; we
can't print it, but since we never convert it in any way, it'll still
open.)

In the case of wmain(), I'm not sure.  They've already been converted
for us and there's nothing we can do about it.  I think you're
*supposed* to access them with wide file access functions, which means
filenames need to be converted from UTF-8 to UCS2 and opened with wide
functions (on NT systems.)  PITA, but doable.

This means we need to wrap fopen().  If we're on a system with no wide
functions, simply open it normally (pass through).  If we're on one that
does have them (NT), then we know for sure the filename is UTF-8 (since
our wrapper wmain() converted to that; and our LC_CTYPE is UTF-8
anyway.)  Convert it back to UCS2 and use _wfopen instead.  (We know that
we'll end up with whatever wmain() passed us, since UCS2->UTF8->UCS2 is
easy.)

As for printing: like I said, on Unix systems, don't convert at all.
Leave 'em exactly as the user gave them.

On NT systems, we have to convert them back to the local codepage.
Since we'll always convert them to UTF-8 (since they start out UCS2)
it's useless to output them unconverted.  What happens if we're on an NT
English system, and we display Japanese text this way?  I'm not sure; I
suspect it'll print "?"s, since we just converted it to the English
codepage.  To fix this, we'd need to wrap printf to do the same thing as
fopen.  (If the fopen wrapper already exists, then this one's no big
deal to add.)

So it becomes a bit more than it was; but it's still a lot simpler than
making every buffer and string operation wide.

> How so? All of the tags are stored as UNICODE UTF8. How is translating the
> command line from (whatever) to UTF8 sooner rather than later going to screw
> CJK Win98 users any more than they already are?

Win98 users can't use a Unicode version of the program, since the OS
doesn't support it.  (It can handle codepage conversions, I *think*, but
it can't deal with wmain(), for example.)  If the only way to get that text
to work is to use a Unicode version, then the 9x Japanese users have no
option at all.

As long as main() behaves like I believe it does, this isn't the case,
however.  We just have to make sure vcedit.c knows that LC_CTYPE is
effectively the codepage, so it knows to convert buffers properly.

So, what is LC_CTYPE set to?  1: In Unix, it's whatever LC_CTYPE is set
to.  Leave it alone.  2: In Windows with Unicode, it's UTF-8.  3:  In
Windows without Unicode, it's the codepage ("CPnnn".)  #2 should be
optional; programmers should be able to use #3 on NT systems if they
don't want to deal with UTF-8 directly at all.  (This would result in
less than ideal behavior for people like me, on English systems
displaying Japanese, but that's going to be the case anyway.)


-- 
Glenn Maynard

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.




More information about the Vorbis mailing list