[vorbis] Re: UTF8, vorbiscomment, oggenc, and 'vcedit.c'

Thu Jan 10 13:04:45 PST 2002

On Thu, Jan 10, 2002 at 02:19:22PM -0500, Peter Harris wrote:
> > > but there is no 'Right Thing' for all windows in general.
> >
> > Not true, in 2K anyway.
> >
> > For example, take WinMX.  It only has a standard WM_CHAR message
> > handler.
> *snip rest of text where you seem to be violently agreeing with me*

Er.  You said NT the local character set is UCS2.  THat's true only at a
low level, and unless you actively use the *W functions, the local
character set is set by the codepage.

> For sure. Note, however, that we are talking about a console app here. No
> WM_CHAR, or WM_anything messages of any sort.

I know.

> I think it will 'just work', right up until that person tries to play back
> via XMMS. "Why are all my tags garbage?"

Except that they're likely to write players (for Japanese users) that
*do* work for this.  Note that there are, in fact, a large number of
incorrectly coded MP3 tags in the wild.

> > Another, more general reason I disagree: If I'm a Unix programmer (hey
> > wait, I am!), all of my C library calls will respect the locale.  All of
> > my own functions will, too, unless I'm writing an internally-UTF8
> > program.  Everybody respects LC_CTYPE says; if my program is internally
> > UTF8, I'll probably override LC_CTYPE to UTF-8.  Well-behaved, modern
> > libraries should honor that locale setting.
> 
> Hmm. I'm not a multilanguage Unix guy. I'll forward this message to
> vorbis-dev to see if any of those people have a comment on this issue.
> 
> On Windoze, it seems that the 'right thing' to do is accept UTF-8 only.
> LC_CTYPE definitely isn't a standard, and I can't think of any obvious way
> to get the locale the user thinks they're working in.

LC_CTYPE isn't a standard?  It absolutely is.  Read locale(7); note
"CONFORMS TO".

(Suggestion: read up well on locales before doing any i18n-related work in
Unix.  Sorry, I don't have any references off-hand; I got most of my
info in passing, code and manpages.  http://www.cl.cam.ac.uk/~mgk25/unicode.html is a good
start, but it's Unicode-centered.)

> Makes wprintf (which my docs claim is ANSI/ISO C, actually) pretty useless
> on NT, IMHO.

That's what I figured.  (If the output is just a FILE *, then all you're
*able* to output is codepage data, unless they're doing something really
evil under the hood.)

This is a limitation, but it makes things simpler.  Output is the locale
(*nix) or codepage (Win32), period; it's the only option.

> We _can_ change the console output code page if we want to, however I can't
> see any advantage to it. (How do we guess at the correct code page from the
> UTF8 source? What if the user wanted a different code page in order to see a
> different subset of the characters?)

I don't think that would help anything.  I don't know how you can do
that, but unless you change the font, too, you still can't display much.

Also, I don't think you can have different encodings for LC_CTYPE and
LC_MESSAGES.  (There's one function to find out the "character encoding
used in the selected locale", nl_langinfo(CODESET).)

> I think that #3 will always only work in a subset of cases where #2 will
> work. Where is the advantage in supporting a method that doesn't work?

The advantage is that we don't require the user to do any conversions
himself; he passes data in as a regular string.  If he really *does*
want to do that himself, and likes dealing with UTF-8, that's fine; he
can use #2.

#3 is the equivalent of #1 in Windows, so it shouldn't take much effort
to support if #1 is done first.

> > I'm willing to help in this directly, by the way.  I think the only
> > point we're actively disagreeing on is whether the API should always
> > take UTF-8 (you) or LC_CTYPE (me); read what I said about it and let's
> > try to resolve that.
> 
> Maybe vcedit.c should respect LC_CTYPE on sane systems, and pretend it's
> UTF8 on Windoze? Or should we have three interfaces on Windoze (UCS2,
> default ANSI code page, current code page, those being the three things an
> app is likely to have for its "LC_CTYPE")?

I don't think providing a UCS2 interface is worth it.  If someone using
the library wants Unicode, he's probably just as happy with UTF-8 as
UCS2, and we don't need any extra types for UTF-8.

I believe it's as simple as this (example code is more concise than
trying to explain it in English):

/* whether the API expects UTF-8 instead of the locale or active codepage */
static bool api_is_utf8 = false;

const char *api_encoding()
{
        static char buf[200];

        if(api_is_utf8) return "UTF-8";

#ifdef _WIN32
        snprintf(buf, 200, "cp%i", GetACP());
#else
        strncpy(buf, nl_langinfo(CODESET), 200);
#endif
        buf[199] = 0;

        return buf;
}

On Unix systems, the API defaults to the locale.

On Windows systems, the API defaults to the codepage (which is the rough
equivalent, and what most users would expect.)

For those who want UTF-8 in the API, they simply set api_is_utf8 to
true.  Output stays the same (it has to), and API calls assume input is
UTF-8.  (This would work both in Windows and Unix, and be useful for
both.)

Defaulting to UTF-8 off means the user doesn't have to worry about
portability between Windows and Unix (for this, anyway); he simply
passes in strings and the library handles all of the conversion.

> Would someone from xiph care to comment on this? Monty? Jack?

I'd like to hear from them as well.

-- 
Glenn Maynard

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.