[vorbis] Re: UTF8, vorbiscomment, oggenc, and 'vcedit.c'

Glenn Maynard g_ogg at zewt.org
Thu Jan 10 14:53:26 PST 2002



On Thu, Jan 10, 2002 at 05:05:54PM -0500, Peter Harris wrote:
> #define UNICODE before you #include <windows.h>, and you get *W by default
> and you need to actively use *A if that's what you want.

But that means you have to use wide strings all over the place, too.
That's not a typical way people program in Windows, especially since
such binaries are incompatible with 9x.

Anyway, I think we agree where it matters.

> POSIX.2. Which Windows definitely does not conform to. Not even the NTs that
> supposedly have a POSIX subsystem.

Right; the calls to find out character sets need to have special cases
for Windows anyway.

> Ah. I see where you were going now. I was talking specifically and only
> about vorbiscomment and oggenc. You were talking about libraries and all
> potential future apps written by other people.

That is, vcedit.c, not vcomment.c.

> On Windows systems, argv[] happens in the default ANSI code page.
> fgetc(stdin) happens in the current input code page (which is _never_ the
> default ANSI code page on my system, as far as I can tell). printf() happens
> in the current output code page (which is usually the same as the input code
> page, but doesn't need to be).

Yay, now *I'm* confused.  I've only ever seen Windows non-UCS2 strings
encoded in what GetACP() says.

> Wild idea (ie. I don't know yet if I like it or not): Putting the interface
> in UCS2 only (on Win32, of course. We can still be sane on *nix) forces the
> programmer to think about where the string is coming from. This sounds like
> a silly thing to do, but it might reduce the multi-language problems you
> mention above.

That means duplicating the whole codebase, or infesting the whole thing
with LCHAR, or whatever Windows calls its sometimes-char-sometimes-wchar
data type.  Neither is at all attractive.

> Err... in the context of a library, output should always contain as much
> information as possible. That means UTF-8. Forcing the characters into the
> local character set isn't very productive if you are going to turn around
> and use the information for (eg) tagging another file. It should be possible
> to do that sort of operation losslessly. The library doesn't know if your
> are going to dump the tag to the screen. And even if you are, the library
> doesn't know if you are an X app with 16-bit UNICODE fonts installed on your
> X server.

That's true.  However, something needs to be done to make sure we don't
get oddly encoded data files.  And making the library look at it and
reject it if it's not valid UTF-8 is no good; the programmer will just
become annoyed, say "just take my data, damn it!" and override it.
Especially if they don't like UTF-8, and don't want to deal with it at
all, we need to make sure we make it convenient..  (Even if their
dislike isn't justified; if they decide they don't like it because their
parrot told them not to, or because they don't like the initials, the
end result would be the same.)


-- 
Glenn Maynard

--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.




More information about the Vorbis mailing list