[vorbis] TAG Standard - ENSEMBLE/PERFORMER tags

Wed Jan 9 19:35:16 PST 2002

On Wed, Jan 09, 2002 at 09:24:14PM -0500, Peter Harris wrote:
> > > Right now, oggenc works entirely in the local character set, and only
> > > converts to UTF8 at the last possible instant.
> >
> > That's probably the Right Thing.
> 
> On Unix and maybe 9x, yes. On NT, the 'local character set' is UCS2. I
> suppose you could detect NT vs 9x and use the UNICODE file functions on NT,
> but there is no 'Right Thing' for all windows in general.

Not true, in 2K anyway.

For example, take WinMX.  It only has a standard WM_CHAR message
handler.  (No, I don't have source; this is the standard problem, and
I'm assuming for the example that it's like everything else.)  If I
enter Japanese text into WinMX (via the IME), it changes them into "?".
Why?  Well, the first thing it tries to do is send via the IME API.  It
doesn't take that, so it falls back on WM_WCHAR (IIRC), which sends it
in Unicode.  It doesn't take that, either, so it falls back on WM_CHAR
(which is what most everything uses).  To do that, it has to fit it into
the system codepage, which is English.  (Er, don't remember what number.)
It doesn't fit, so it gets changed to a "?".

If I change my system to "Japanese" in the Regional Settings dialog box
(and reboot--grr), and do the same thing, it magically works.  The local
codepage is Japanese (shift-jis), so it fits in WM_CHAR.  (I'm not sure
exactly how this works.)

NT merely adds Unicode as a global catch-all that it tries before the
system codepage.

> Actually, vcedit.c does all work in UTF8. The user of vcedit is required to
> do locale conversions for both input and output. A good idea for a library,
> in my opinion.

I disagree.  For example, if we require users to do these conversions,
and a Japanese programmer who (for any, irrelevant reason) doesn't like
Unicode and doesn't want to spend any time doing these conversions uses
it, he'll probably just dump in SJIS (or JIS or EUC-JP) text to the
functions.  It'll probably "work" for him (ie. actually store SJIS in
the file), and a lot of Japanese users will end up with SJIS tags.  (I'm
beginning to believe this is exactly what happened with ID3V2.)  We need
to preempt that at every turn and make sure tag data is UTF-8.

Another, more general reason I disagree: If I'm a Unix programmer (hey
wait, I am!), all of my C library calls will respect the locale.  All of
my own functions will, too, unless I'm writing an internally-UTF8
program.  Everybody respects LC_CTYPE says; if my program is internally
UTF8, I'll probably override LC_CTYPE to UTF-8.  Well-behaved, modern
libraries should honor that locale setting.

> Actually, it's the regular main() that's a problem on Windoze. On NT at
> least, in the case of main(), the commandline has already been converted for
> us (losing some information, possibly destroying the file name).

Yeah, the filenames in NT are Unicode anyway so they have to be
converted at some point from the local codepage.  Got that.  (I havn't
done this stuff in Windows in a while, so I'm rusty, and I'm currently
scraping off rust that I'd rather have left in place. :)

> > As for printing: like I said, on Unix systems, don't convert at all.
> > Leave 'em exactly as the user gave them.
> 
> Doesn't work on NT. (I'm not quite so certain about 9x).
> argv[] is in CP_ACP. printf() requires CP_<current set>. Translation is
> required for reasonable output. Annoying but true.

There are wide versions of printf (_wprintf, iirc), etc.  We just have
to translate from UTF-8 back to UCS2.

> *searches docudementation*
> Eww. It looks like 9x has the same problem as NT: argv[] is in CP_ACP,
> filenames and printf() are in CP_<current code page>

main(), and all regular (non-wide) functions have the same behavior in
this respect between 9x and NT.  (I'm sure there are a lot of
exceptions, but that's the principle; it's to maintain compatibility.)

> > On NT systems, we have to convert them back to the local codepage.
> > Since we'll always convert them to UTF-8 (since they start out UCS2)
> > it's useless to output them unconverted.  What happens if we're on an NT
> > English system, and we display Japanese text this way?  I'm not sure; I
> > suspect it'll print "?"s, since we just converted it to the English
> > codepage.  To fix this, we'd need to wrap printf to do the same thing as
> > fopen.  (If the fopen wrapper already exists, then this one's no big
> > deal to add.)
> 
> I don't know what it will do to Japanese text on an english system. I do
> know that converting from UNICODE to CP_<current page> drops accents in an
> attempt to 'best-fit' european characters on my english system.

I'm 98% sure it'll output "?" for characters it can't find a fit at all
for, including CJK.  On a 9x system, for a console app, there isn't
anything we can do about this, AFAIK.  On an NT system, we have _wprintf
available, though I suspect that'll just convert it to the local
codepage and do the same thing.  (In which case, there's nothing we can
do about it anyway.)

If it's simply impossible to output text in other codepages in a console
app in Windows, then we should forget about _wprintf (it won't buy us
anything.)  Just convert from UTF-8 to LC_MESSAGES type (I think) and
print it; for a Windows system, that locale setting will be the
codepage.

> > Win98 users can't use a Unicode version of the program, since the OS
> > doesn't support it.  (It can handle codepage conversions, I *think*, but
> > it can't deal with wmain(), for example.)  If the only way to get that
> text
> > to work is to use a Unicode version, then the 9x Japanese users have no
> > option at all.
> 
> According to my copy of the docs, Win9x supports GetCommandLineW(). I'm
> guessing it will convert the command line from the current code page to
> UCS2.

If so, then using that is probably a better idea, so all Windows
programs use the same entry point.  (It'd still be a wrapper, to convert
to UTF-8.)

> > So, what is LC_CTYPE set to?  1: In Unix, it's whatever LC_CTYPE is set
> > to.  Leave it alone.  2: In Windows with Unicode, it's UTF-8.  3:  In
> > Windows without Unicode, it's the codepage ("CPnnn".)  #2 should be
> > optional; programmers should be able to use #3 on NT systems if they
> > don't want to deal with UTF-8 directly at all.  (This would result in
> > less than ideal behavior for people like me, on English systems
> > displaying Japanese, but that's going to be the case anyway.)
> 
> #3 is what we do right now. See vorbis-tools/share/utf8.c, inside #ifdef
> WIN32
> 
> For option #3, on my Canadian English NT system, LC_CTYPE (or, at least, the
> type of argv) is always CP_ACP, regardless of the code page of the console.
> Unfortunately, printf() expects data in CP_<current set code page>.
> 
> Option #2 is what I would like to do. I'd like to keep it as clean as
> possible, though. Sounds like an fopen() wrapper will be necessary. I'm not
> quite so certain about what to do with printf(), though. (Definitely some
> sort of wrapper, but if I can get away with a puts() wrapper instead of a
> printf() wrapper, that would be _much_ easier to deal with).

I'm saying that it's important that we do *all three* of them.  All
three can be done at once, cleanly and without a lot of code.  If you
need more explanation, let me know and I'll give it.

We need to make it easy for Japanese programmers to use vcedit.c as a library,
without having to deal with encodings, and get correct behavior.  If we
don't make it happen transparently, they'll probably just put SJIS in
tags.

I'm willing to help in this directly, by the way.  I think the only
point we're actively disagreeing on is whether the API should always
take UTF-8 (you) or LC_CTYPE (me); read what I said about it and let's
try to resolve that.

-- 
Glenn Maynard

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.