[vorbis] Re: UTF8, vorbiscomment, oggenc, and 'vcedit.c'

Thu Jan 10 11:19:22 PST 2002

> > but there is no 'Right Thing' for all windows in general.
>
> Not true, in 2K anyway.
>
> For example, take WinMX.  It only has a standard WM_CHAR message
> handler.
*snip rest of text where you seem to be violently agreeing with me*

For sure. Note, however, that we are talking about a console app here. No
WM_CHAR, or WM_anything messages of any sort.

> > Actually, vcedit.c does all work in UTF8. The user of vcedit is required
to
> > do locale conversions for both input and output. A good idea for a
library,
> > in my opinion.
>
> I disagree.  For example, if we require users to do these conversions,
> and a Japanese programmer who (for any, irrelevant reason) doesn't like
> Unicode and doesn't want to spend any time doing these conversions uses
> it, he'll probably just dump in SJIS (or JIS or EUC-JP) text to the
> functions.  It'll probably "work" for him (ie. actually store SJIS in
> the file), and a lot of Japanese users will end up with SJIS tags.  (I'm
> beginning to believe this is exactly what happened with ID3V2.)  We need
> to preempt that at every turn and make sure tag data is UTF-8.

I think it will 'just work', right up until that person tries to play back
via XMMS. "Why are all my tags garbage?"

> Another, more general reason I disagree: If I'm a Unix programmer (hey
> wait, I am!), all of my C library calls will respect the locale.  All of
> my own functions will, too, unless I'm writing an internally-UTF8
> program.  Everybody respects LC_CTYPE says; if my program is internally
> UTF8, I'll probably override LC_CTYPE to UTF-8.  Well-behaved, modern
> libraries should honor that locale setting.

Hmm. I'm not a multilanguage Unix guy. I'll forward this message to
vorbis-dev to see if any of those people have a comment on this issue.

On Windoze, it seems that the 'right thing' to do is accept UTF-8 only.
LC_CTYPE definitely isn't a standard, and I can't think of any obvious way
to get the locale the user thinks they're working in.

> > > As for printing: like I said, on Unix systems, don't convert at all.
> > > Leave 'em exactly as the user gave them.
> >
> > Doesn't work on NT. (I'm not quite so certain about 9x).
> > argv[] is in CP_ACP. printf() requires CP_<current set>. Translation is
> > required for reasonable output. Annoying but true.
>
> There are wide versions of printf (_wprintf, iirc), etc.  We just have
> to translate from UTF-8 back to UCS2.

Actually, even on NT wprintf truncates before printing, resulting in garbage
on the screen.

Testing... yup. Umlaut-u gets wprintf()ed as superscript-n in all the code
pages I have installed (even though the u doesn't look like it has an umlaut
in some of the code pages).

Makes wprintf (which my docs claim is ANSI/ISO C, actually) pretty useless
on NT, IMHO.

> If it's simply impossible to output text in other codepages in a console
> app in Windows, then we should forget about _wprintf (it won't buy us
> anything.)  Just convert from UTF-8 to LC_MESSAGES type (I think) and
> print it; for a Windows system, that locale setting will be the
> codepage.

Yes, this works (at least on NT).

We _can_ change the console output code page if we want to, however I can't
see any advantage to it. (How do we guess at the correct code page from the
UTF8 source? What if the user wanted a different code page in order to see a
different subset of the characters?)

> > > So, what is LC_CTYPE set to?  1: In Unix, it's whatever LC_CTYPE is
set
> > > to.  Leave it alone.  2: In Windows with Unicode, it's UTF-8.  3:  In
> > > Windows without Unicode, it's the codepage ("CPnnn".)  #2 should be
> > > optional; programmers should be able to use #3 on NT systems if they
> > > don't want to deal with UTF-8 directly at all.  (This would result in
> > > less than ideal behavior for people like me, on English systems
> > > displaying Japanese, but that's going to be the case anyway.)
> >
> > #3 is what we do right now. See vorbis-tools/share/utf8.c, inside #ifdef
> > WIN32
> >
> > For option #3, on my Canadian English NT system, LC_CTYPE (or, at least,
the
> > type of argv) is always CP_ACP, regardless of the code page of the
console.
> > Unfortunately, printf() expects data in CP_<current set code page>.
> >
> > Option #2 is what I would like to do. I'd like to keep it as clean as
> > possible, though. Sounds like an fopen() wrapper will be necessary. I'm
not
> > quite so certain about what to do with printf(), though. (Definitely
some
> > sort of wrapper, but if I can get away with a puts() wrapper instead of
a
> > printf() wrapper, that would be _much_ easier to deal with).
>
> I'm saying that it's important that we do *all three* of them.  All
> three can be done at once, cleanly and without a lot of code.  If you
> need more explanation, let me know and I'll give it.

I think that #3 will always only work in a subset of cases where #2 will
work. Where is the advantage in supporting a method that doesn't work?

> We need to make it easy for Japanese programmers to use vcedit.c as a
library,
> without having to deal with encodings, and get correct behavior.  If we
> don't make it happen transparently, they'll probably just put SJIS in
> tags.
>
> I'm willing to help in this directly, by the way.  I think the only
> point we're actively disagreeing on is whether the API should always
> take UTF-8 (you) or LC_CTYPE (me); read what I said about it and let's
> try to resolve that.

Maybe vcedit.c should respect LC_CTYPE on sane systems, and pretend it's
UTF8 on Windoze? Or should we have three interfaces on Windoze (UCS2,
default ANSI code page, current code page, those being the three things an
app is likely to have for its "LC_CTYPE")?

Actually, this is where it gets out of my realm. I'm happy to help, but I
don't think I'm up for deciding policy.

Would someone from xiph care to comment on this? Monty? Jack?

Peter Harris

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.