[vorbis] TAG Standard - ENSEMBLE/PERFORMER tags

Peter Harris peter.harris at hummingbird.com
Wed Jan 9 18:24:14 PST 2002



> > Right now, oggenc works entirely in the local character set, and only
> > converts to UTF8 at the last possible instant.
>
> That's probably the Right Thing.

On Unix and maybe 9x, yes. On NT, the 'local character set' is UCS2. I
suppose you could detect NT vs 9x and use the UNICODE file functions on NT,
but there is no 'Right Thing' for all windows in general.

Here I'm noticing that Solaris, Linux, *BSD, Irix, etc. are all more
compatible with each other than Windows is with itself.

That says something. I'm not quite sure what, but something.

> Here's the trick, though: we need to make sure that the libraries we end
> up with are capable of running in any encoding.  If someone's in a
> Shift-JIS encoding, and is writing an editor using this code, he'll
> expect it to be able to take Shift-JIS input to the functions.  Looking
> at the files, this means vcedit.c needs to be able to deal with
> arbitrary encodings; it's the file acting as the "library".

Actually, vcedit.c does all work in UTF8. The user of vcedit is required to
do locale conversions for both input and output. A good idea for a library,
in my opinion.

> > produce identical output to the input on the command line, but I'm not
sure
> > exactly how far I can trust MultiByteToWideChar() <->
WideCharToMultiByte()
> > (or a double ICONV for Unix people).
>
> In Unix, filenames should never be touched, not when printed and not
> when opening.

Okay. So all-UTF8 is out, then.

> In the case of wmain(), I'm not sure.  They've already been converted
> for us and there's nothing we can do about it.

Actually, it's the regular main() that's a problem on Windoze. On NT at
least, in the case of main(), the commandline has already been converted for
us (losing some information, possibly destroying the file name).

>  I think you're
> *supposed* to access them with wide file access functions, which means
> filenames need to be converted from UTF-8 to UCS2 and opened with wide
> functions (on NT systems.)  PITA, but doable.
*snip implementation details*

Agreed.

> As for printing: like I said, on Unix systems, don't convert at all.
> Leave 'em exactly as the user gave them.

Doesn't work on NT. (I'm not quite so certain about 9x).
argv[] is in CP_ACP. printf() requires CP_<current set>. Translation is
required for reasonable output. Annoying but true.

*searches docudementation*
Eww. It looks like 9x has the same problem as NT: argv[] is in CP_ACP,
filenames and printf() are in CP_<current code page>

<quote source="SetFileApisToOEM">
When dealing with command lines, a console application should obtain the
command line in Unicode form and then convert it to OEM form using the
relevant character-to-OEM functions. Note also that the array in the argv
parameter of the command-line main function contains ANSI character set
strings in this case.
</quote>

No wonder projects like wine have trouble: They have to emulate all this
weirdness.

> On NT systems, we have to convert them back to the local codepage.
> Since we'll always convert them to UTF-8 (since they start out UCS2)
> it's useless to output them unconverted.  What happens if we're on an NT
> English system, and we display Japanese text this way?  I'm not sure; I
> suspect it'll print "?"s, since we just converted it to the English
> codepage.  To fix this, we'd need to wrap printf to do the same thing as
> fopen.  (If the fopen wrapper already exists, then this one's no big
> deal to add.)

I don't know what it will do to Japanese text on an english system. I do
know that converting from UNICODE to CP_<current page> drops accents in an
attempt to 'best-fit' european characters on my english system.

> > How so? All of the tags are stored as UNICODE UTF8. How is translating
the
> > command line from (whatever) to UTF8 sooner rather than later going to
screw
> > CJK Win98 users any more than they already are?
>
> Win98 users can't use a Unicode version of the program, since the OS
> doesn't support it.  (It can handle codepage conversions, I *think*, but
> it can't deal with wmain(), for example.)  If the only way to get that
text
> to work is to use a Unicode version, then the 9x Japanese users have no
> option at all.

According to my copy of the docs, Win9x supports GetCommandLineW(). I'm
guessing it will convert the command line from the current code page to
UCS2.

(On NT, the current command line is stored in UCS2, and only displayed via
the current code page. Which means you can pass in characters that aren't in
the current code page via copy-and-paste. Or, at least, you can on my
system.)

> As long as main() behaves like I believe it does, this isn't the case,
> however.  We just have to make sure vcedit.c knows that LC_CTYPE is
> effectively the codepage, so it knows to convert buffers properly.
>
> So, what is LC_CTYPE set to?  1: In Unix, it's whatever LC_CTYPE is set
> to.  Leave it alone.  2: In Windows with Unicode, it's UTF-8.  3:  In
> Windows without Unicode, it's the codepage ("CPnnn".)  #2 should be
> optional; programmers should be able to use #3 on NT systems if they
> don't want to deal with UTF-8 directly at all.  (This would result in
> less than ideal behavior for people like me, on English systems
> displaying Japanese, but that's going to be the case anyway.)

#3 is what we do right now. See vorbis-tools/share/utf8.c, inside #ifdef
WIN32

For option #3, on my Canadian English NT system, LC_CTYPE (or, at least, the
type of argv) is always CP_ACP, regardless of the code page of the console.
Unfortunately, printf() expects data in CP_<current set code page>.

Bizarre, eh?

So to get a printout of the command line that looks right, you need to do
the conversion anyway. Since conversion must be via UCS2 anyway, we might as
well just work in UNICODE. As long as we're working in UNICODE, we might as
well just take the command line in UNICODE in the first place, putting us
back at option #2.

Option #2 is what I would like to do. I'd like to keep it as clean as
possible, though. Sounds like an fopen() wrapper will be necessary. I'm not
quite so certain about what to do with printf(), though. (Definitely some
sort of wrapper, but if I can get away with a puts() wrapper instead of a
printf() wrapper, that would be _much_ easier to deal with).

Peter Harris

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.



More information about the Vorbis mailing list