[vorbis-dev] UTF-8 & Hebrew

Beni Cherniavksy cben at techunix.technion.ac.il
Thu Mar 7 00:55:26 PST 2002



On 2002-03-07, SyP wrote:

> Hello Ross,
>
> You wrote:
>
> > I understood that UTF-8 utilizes as many 8-bit characters as it needs to
> > store the required entended character.  I think that can be up to 4
> > bytes (32 bits).  The old 16-bit unicode standard is apparently on it's
> > way out.
>
It's 6 IIRC.  Unicode itself is 32 bits.  To potentially represent it all
in a varable-length encoding, where some characters are shorter than 4
bytes, and if you want all bytes (except for 128 characters) to have top
bit set, and also want to easily detect character boundaries (which takes
up more bits) - it seems pretty good that it always fits into 6 bytes...

> Unicode isn't 16 bit, explicitly not, from at least Unicode 2.0.
> The confusion is partly caused by that, for a long time, there wasn't
> any characters in Unicode outside of the 0000-FFFF range.
>
> Now, understand that UTF-8 is not Unicode, UTF-8 is a *representation*
> of Unicode, using byte sequences of varying length.
>
> > Currently if I (or you) enter an Arial #216 character (O slash,Ø) into a
> > comment, it is saved as 2 bytes (C398) and displayed correctly as one
> > character in my app & the Winamp comment editor, so I don't understand
> > why this works but the Hebrew characters do not.  Is it that UTF-8 is
> > not fully supported in Windows.
>
> Windows 95/8/ME's multibyte support, Hebrew support, CJK support is a
> hack compared to the all-unicodeness which started with NT, and fully
> supported by 2K and XP. And I think displaying Hebrew text isn't as
> simple as displaying, say, Cyrillic, it's right-to-left, it has some
> complicated system of annotation dots, etc. So I don't think that a
> Cyrillic Win98 will display the Hebrew comments correctly ever, but I
> may be wrong.
>
Sure.  I myself never tried a Cyrrilic win98 but English ones don't have
right-to-left support and I never heard of any but Hebrew ones to have it.
Annotation dots are very seldom used (I write them once in a couple of
mothes, of all my song titles I had only one pair that needed the dots to
disambiguate...) so the undoubtful lack of support for them by non-Hebrew
windozes is not a problem.

However, the mentined user had a WinXP which is truly unicode, isn't it?
In my win2k I can see all text (Hebrew & Russian, Hebrew is my default
codepage) in Peter's plugin - when I double click the tag to edit.
However, in the list view of tags, only the current codepage is displayed!
So this is a bug with peter's plugin (I reported it once but it was
ignored).  IIRC the standard tags in non-advanced mode show all right,
only the list view is problematic.


-- 
Beni Cherniavsky <cben at tx.technion.ac.il>
                 (also scben at t2 in Technion)
Common Lisp is better than Common Source and
Open Source is better than Open Collector (YMMV).

<p>--- >8 ----
List archives:  http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body.  No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.




More information about the Vorbis-dev mailing list