[vorbis-dev] UTF-8 stuff
Edmund GRIMLEY EVANS
edmundo at rano.org
Sun Sep 30 12:46:26 PDT 2001
Here's a propsed heavy-duty solution for your UTF-8 problems.
I'm including a patch in this message, but I'll put the new files on
my web site at http://rano.org/tmp/xiph_files.tar.gz
I've tested this by running vorbiscomment with and without
-DHAVE_ICONV=1 in vorbis-tools/share/Makefile. It seems to work.
Changed files:
acinclude.m4: Add a test for nl_langinfo(CODESET). This is the
function that lets you discover the charset of the user's locale
without forcing them to use a command-line argument.
configure.in: Use AM_LANGINFO_CODESET.
utf8.h, utf8.c: These files are totally rewritten, apart from the
Windows part. Instead of utf8_encode() and utf8_decode() there's
convert_to_utf8() and convert_from_utf8(), which no longer have an
"encoding" argument.
oggenc.c, vcomment.c: Call setlocale(), so that nl_langinfo() will
work. Remove the "encoding" option and "encoding" arguments. Instead
there's either nl_langinfo(CODESET) or the environment variable
CHARSET. Call convert_{to|from}_utf8() instead of utf8_{en|de}code().
Makefile.am: Different set of source files.
New files:
charset.c, charset.h: My functions for converting between encodings,
with header file.
charmaps.h: Mapping tables included by charset.c.
makemap.c: A program for generating a mapping table using iconv().
charset_test.c: A test suite for charset.c. Just compile it and link
it with charset.o.
iconvert.c: A function for converting a string using iconv().
Old files:
8859-1.map, 8859-2.map, make_code_map.pl: These are no longer used and
can be removed.
Things to do:
Maybe modify acinclude.m4 to check that iconv() works properly.
Experience with Mutt tells us that there are some severely broken
versions of iconv around.
At present exactly one of iconvert.c and charset.c is included in the
executable, but it's possible some people might want to have both. So
there should probably be a USE_CHARSET_CONVERT macro that by default
is defined iff HAVE_ICONV is not defined, but whose value can be
altered by an argument to configure.
Implement the Windows version of convert_from_utf8().
Edmund
diff -ru xiph.orig/vorbis-tools/acinclude.m4 xiph/vorbis-tools/acinclude.m4
--- xiph.orig/vorbis-tools/acinclude.m4 Tue Aug 21 15:05:09 2001
+++ xiph/vorbis-tools/acinclude.m4 Sun Sep 30 19:14:02 2001
@@ -430,3 +430,19 @@
fi
AC_SUBST(LIBICONV)
])
+
+dnl From Bruno Haible.
+dnl
+AC_DEFUN([AM_LANGINFO_CODESET],
+[
+ AC_CACHE_CHECK([for nl_langinfo and CODESET], am_cv_langinfo_codeset,
+ [AC_TRY_LINK([#include <langinfo.h>],
+ [char* cs = nl_langinfo(CODESET);],
+ am_cv_langinfo_codeset=yes,
+ am_cv_langinfo_codeset=no)
+ ])
+ if test $am_cv_langinfo_codeset = yes; then
+ AC_DEFINE(HAVE_LANGINFO_CODESET, 1,
+ [Define if you have <langinfo.h> and nl_langinfo(CODESET).])
+ fi
+])
diff -ru xiph.orig/vorbis-tools/configure.in xiph/vorbis-tools/configure.in
--- xiph.orig/vorbis-tools/configure.in Mon Sep 24 23:42:12 2001
+++ xiph/vorbis-tools/configure.in Sun Sep 30 19:14:19 2001
@@ -111,6 +111,7 @@
AM_ICONV
AC_FUNC_SMMAP
+AM_LANGINFO_CODESET
dnl --------------------------------------------------
dnl Work around FHS stupidity
diff -ru xiph.orig/vorbis-tools/include/utf8.h xiph/vorbis-tools/include/utf8.h
--- xiph.orig/vorbis-tools/include/utf8.h Sat Sep 22 23:49:49 2001
+++ xiph/vorbis-tools/include/utf8.h Sun Sep 30 18:05:14 2001
@@ -1,18 +1,23 @@
-/* OggEnc
+
+/*
+ * Convert a string between UTF-8 and the locale's charset.
+ * Invalid bytes are replaced by '#', and characters that are
+ * not available in the target encoding are replaced by '?'.
+ *
+ * If the locale's charset is not set explicitly then it is
+ * obtained using nl_langinfo(CODESET), where available, the
+ * environment variable CHARSET, or assumed to be US-ASCII.
*
- * This program is distributed under the GNU General Public License, version 2.
- * A copy of this license is included with this source.
+ * Return value of conversion functions:
*
- * Copyright © 2001, Daniel Resare <noa at metamatrix.se>
+ * -1 : memory allocation failed
+ * 0 : data was converted exactly
+ * 1 : valid data was converted approximately (using '?')
+ * 2 : input was invalid (but still converted, using '#')
+ * 3 : unknown encoding (but still converted, using '?')
*/
-typedef struct
-{
- char* name;
- int mapping[256];
-} charset_map;
+void convert_set_charset(const char *charset);
-charset_map *get_map(const char *encoding);
-char *make_utf8_string(const unsigned short *unicode);
-int simple_utf8_encode(const char *from, char **to, const char *encoding);
-int utf8_encode(char *from, char **to, const char *encoding);
+int convert_to_utf8(const char *from, char **to);
+int convert_from_utf8(const char *from, char **to);
diff -ru xiph.orig/vorbis-tools/oggenc/oggenc.c xiph/vorbis-tools/oggenc/oggenc.c
--- xiph.orig/vorbis-tools/oggenc/oggenc.c Sun Sep 30 17:28:43 2001
+++ xiph/vorbis-tools/oggenc/oggenc.c Sun Sep 30 19:04:03 2001
@@ -15,6 +15,7 @@
#include <getopt.h>
#include <string.h>
#include <time.h>
+#include <locale.h>
#include "platform.h"
#include "encode.h"
@@ -50,7 +51,6 @@
{"date",1,0,'d'},
{"tracknum",1,0,'N'},
{"serial",1,0,'s'},
- {"encoding",1,0,'e'},
{NULL,0,0,0}
};
@@ -75,6 +75,8 @@
int numfiles;
int errors=0;
+ setlocale(LC_ALL, "");
+
parse_options(argc, argv, &opt);
if(optind >= argc)
@@ -320,8 +322,6 @@
" -s, --serial Specify a serial number for the stream. If encoding\n"
" multiple files, this will be incremented for each\n"
" stream after the first.\n"
- " -e, --encoding Specify an encoding for the comments given (not\n"
- " supported on windows)\n"
"\n"
" Naming:\n"
" -o, --output=fn Write file to fn (only valid in single-file mode)\n"
@@ -477,7 +477,7 @@
int ret;
int option_index = 1;
- while((ret = getopt_long(argc, argv, "a:b:B:c:C:d:e:G:hl:m:M:n:N:o:P:q:QrR:s:t:vX:",
+ while((ret = getopt_long(argc, argv, "a:b:B:c:C:d:G:hl:m:M:n:N:o:P:q:QrR:s:t:vX:",
long_options, &option_index)) != -1)
{
switch(ret)
@@ -498,9 +498,6 @@
opt->dates = realloc(opt->dates, (++opt->date_count)*sizeof(char *));
opt->dates[opt->date_count - 1] = strdup(optarg);
break;
- case 'e':
- opt->encoding = strdup(optarg);
- break;
case 'G':
opt->genre = realloc(opt->genre, (++opt->genre_count)*sizeof(char *));
opt->genre[opt->genre_count - 1] = strdup(optarg);
@@ -646,7 +643,7 @@
static void add_tag(vorbis_comment *vc, oe_options *opt,char *name, char *value)
{
char *utf8;
- if(utf8_encode(value, &utf8, opt->encoding) == 0)
+ if(convert_to_utf8(value, &utf8) >= 0)
{
if(name == NULL)
vorbis_comment_add(vc, utf8);
@@ -655,7 +652,7 @@
free(utf8);
}
else
- fprintf(stderr, "Couldn't convert comment to UTF8, cannot add\n");
+ fprintf(stderr, "Couldn't convert comment to UTF-8, cannot add\n");
}
static void build_comments(vorbis_comment *vc, oe_options *opt, int filenum,
diff -ru xiph.orig/vorbis-tools/share/Makefile.am xiph/vorbis-tools/share/Makefile.am
--- xiph.orig/vorbis-tools/share/Makefile.am Sun Sep 23 00:13:50 2001
+++ xiph/vorbis-tools/share/Makefile.am Sun Sep 30 20:31:57 2001
@@ -6,12 +6,11 @@
noinst_LIBRARIES = libutf8.a libgetopt.a
-libutf8_a_SOURCES = utf8.c
-MAP_FILES = 8859-1.map 8859-2.map
+libutf8_a_SOURCES = charset.c iconvert.c utf8.c
libgetopt_a_SOURCES = getopt.c getopt1.c
-EXTRA_DIST = $(MAP_FILES) charsetmap.h make_code_map.pl
+EXTRA_DIST = charmaps.h makemap.c charset_test.c
debug:
$(MAKE) all CFLAGS="@DEBUG@"
diff -ru xiph.orig/vorbis-tools/share/utf8.c xiph/vorbis-tools/share/utf8.c
--- xiph.orig/vorbis-tools/share/utf8.c Wed Sep 26 20:28:05 2001
+++ xiph/vorbis-tools/share/utf8.c Sun Sep 30 20:23:21 2001
@@ -1,30 +1,40 @@
-/* OggEnc
- *
- * This program is distributed under the GNU General Public License, version 2.
- * A copy of this license is included with this source.
- *
- * (C) 2001 Michael Smith <msmith at labyrinth.net.au>
+/*
+ * Copyright (C) 2001 Peter Harris <peter.harris at hummingbird.com>
+ * Copyright (C) 2001 Edmund Grimley Evans <edmundo at rano.org>
*
- * UTF-8 Conversion routines
- * Copyright (C) 2001, Daniel Resare <noa at metamatrix.se>
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+/*
+ * Convert a string between UTF-8 and the locale's charset.
*/
-#include <stdio.h>
#include <stdlib.h>
#include <string.h>
+
#include "utf8.h"
#ifdef _WIN32
+#include <stdio.h>
#include <windows.h>
-int utf8_encode(char *from, char **to, const char *encoding)
+int convert_to_utf8(const char *from, char **to)
{
/* Thanks to Peter Harris <peter.harris at hummingbird.com> for this win32
* code.
- *
- * We ignore 'encoding' and assume that the input is in the 'code page'
- * of the console. Reasonable, since oggenc is a console app.
*/
unsigned short *unicode;
@@ -36,14 +46,14 @@
if(wchars == 0)
{
fprintf(stderr, "Unicode translation error %d\n", GetLastError());
- return 1;
+ return -1;
}
unicode = calloc(wchars + 1, sizeof(unsigned short));
if(unicode == NULL)
{
fprintf(stderr, "Out of memory processing string to UTF8\n");
- return 1;
+ return -1;
}
err = MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, from,
@@ -52,7 +62,7 @@
{
free(unicode);
fprintf(stderr, "Unicode translation error %d\n", GetLastError());
- return 1;
+ return -1;
}
/* On NT-based windows systems, we could use WideCharToMultiByte(), but
@@ -64,234 +74,101 @@
return 0;
}
-int utf8_decode(char *from, char **to, const char *encoding)
+int convert_from_utf8(const char *from, char **to)
{
- return 1; /* Dummy stub */
+ return -1; /* Dummy stub */
}
#else /* End win32. Rest is for real operating systems */
-#ifdef HAVE_ICONV
-#include <iconv.h>
-#include <errno.h>
+
+#ifdef HAVE_LANGINFO_CODESET
+#include <langinfo.h>
#endif
-#include "charsetmap.h"
+static char *current_charset = 0; /* means "US-ASCII" */
-#define BUFSIZE 256
+void convert_set_charset(const char *charset)
+{
-/*
- Converts the string FROM from the encoding specified in ENCODING
- to UTF-8. The resulting string i pointed to by *TO.
+#ifdef HAVE_LANGINFO_CODESET
+ if (!charset)
+ charset = nl_langinfo(CODESET);
+#endif
- Return values:
- 0 indicates a successfully converted string.
- 1 indicates that the given encoding is not available.
- 2 indicates that the given string is bigger than BUFSIZE and can therefore
- not be encoded.
- 3 indicates that given string could not be parsed.
-*/
-int utf8_encode(char *from, char **to, const char *encoding)
+ if (!charset)
+ charset = getenv("CHARSET");
+
+ free(current_charset);
+ current_charset = 0;
+ if (charset && *charset)
+ current_charset = strdup(charset);
+}
+
+static int convert_buffer(const char *fromcode, const char *tocode,
+ const char *from, size_t fromlen,
+ char **to, size_t *tolen)
{
+ int ret = -1;
+
#ifdef HAVE_ICONV
- static unsigned char buffer[BUFSIZE];
- char *from_p, *to_p;
- size_t from_left, to_left;
- iconv_t cd;
+ ret = iconvert(fromcode, tocode, from, fromlen, to, tolen);
+ if (ret != -1)
+ return ret;
#endif
- if (!strcasecmp(encoding, "UTF-8")) {
- /* ideally some checking of the given string should be done */
- *to = malloc(strlen(from) + 1);
- strcpy(*to, from);
- return 0;
- }
-
-#ifdef HAVE_ICONV
- cd = iconv_open("UTF-8", encoding);
- if(cd == (iconv_t)(-1))
- {
- if(errno == EINVAL) {
- /* if iconv can't encode from this encoding, try
- * simple_utf8_encode()
- */
- return simple_utf8_encode(from, to, encoding);
- } else {
- perror("iconv_open");
- }
- }
-
- from_left = strlen(from);
- to_left = BUFSIZE;
- from_p = from;
- to_p = buffer;
-
- if(iconv(cd, (ICONV_CONST char **)(&from_p), &from_left, &to_p,
- &to_left) == (size_t)-1)
- {
- iconv_close(cd);
- switch(errno)
- {
- case E2BIG:
- /* if the buffer is too small, try simple_utf8_encode()
- */
- return simple_utf8_encode(from, to, encoding);
- case EILSEQ:
- case EINVAL:
- return 3;
- default:
- perror("iconv");
- }
- }
- else
- {
- iconv_close(cd);
- }
- *to = malloc(BUFSIZE - to_left + 1);
- buffer[BUFSIZE - to_left] = 0;
- strcpy(*to, buffer);
- return 0;
-#else
- return simple_utf8_encode(from, to, encoding);
+#ifndef HAVE_ICONV /* should be ifdef USE_CHARSET_CONVERT */
+ ret = charset_convert(fromcode, tocode, from, fromlen, to, tolen);
+ if (ret != -1)
+ return ret;
#endif
+
+ return ret;
}
-/*
- This implementation has the following limitations: The given charset must
- represent each glyph with exactly one (1) byte. No multi byte or variable
- width charsets are allowed. (An exception to this i UTF-8 that is passed
- right through.) The glyhps in the charsets must have a unicode value equal
- to or less than 0xFFFF (this inclues pretty much everything). For a complete,
- free conversion implementation please have a look at libiconv.
-*/
-int simple_utf8_encode(const char *from, char **to, const char *encoding)
+static int convert_string(const char *fromcode, const char *tocode,
+ const char *from, char **to, char replace)
{
- /* can you always know this will be 16 bit? */
- unsigned short *unicode;
- charset_map *map;
- int index = 0;
- unsigned char c;
-
- unicode = calloc((strlen(from) + 1), sizeof(short));
-
- map = get_map(encoding);
-
- if (map == NULL)
- return 1;
+ int ret;
+ size_t fromlen;
+ char *s;
- c = from[index];
- while(c)
- {
- unicode[index] = map->mapping[c];
- index++;
- c = from[index];
- }
+ fromlen = strlen(from);
+ ret = convert_buffer(fromcode, tocode, from, fromlen, to, 0);
+ if (ret == -2)
+ return -1;
+ if (ret != -1)
+ return ret;
- *to = make_utf8_string(unicode);
- free(unicode);
- return 0;
+ s = malloc(fromlen + 1);
+ if (!s)
+ return -1;
+ strcpy(s, from);
+ *to = s;
+ for (; *s; s++)
+ if (*s & ~0x7f)
+ *s = replace;
+ return 3;
}
-int utf8_decode(char *from, char **to, const char *encoding)
+int convert_to_utf8(const char *from, char **to)
{
-#ifdef HAVE_ICONV
- static unsigned char buffer[BUFSIZE];
- char *from_p, *to_p;
- size_t from_left, to_left;
- iconv_t cd;
- cd = iconv_open(encoding, "UTF-8");
- if(cd == (iconv_t)(-1))
- {
- perror("iconv_open");
- }
-
- from_left = strlen(from);
- to_left = BUFSIZE;
- from_p = from;
- to_p = buffer;
-
- if(iconv(cd, (ICONV_CONST char **)(&from_p), &from_left, &to_p,
- &to_left) == (size_t)-1)
- {
- iconv_close(cd);
- switch(errno)
- {
- case E2BIG:
- case EILSEQ:
- case EINVAL:
- return 3;
- default:
- perror("iconv");
- }
- }
- else
- {
- iconv_close(cd);
- }
- *to = malloc(BUFSIZE - to_left + 1);
- buffer[BUFSIZE - to_left] = 0;
- strcpy(*to, buffer);
- return 0;
-#else
- return 1; /* Dummy stub */
-#endif /* HAVE_ICONV */
-}
+ char *charset;
-charset_map *get_map(const char *encoding)
-{
- charset_map *map_p = maps;
- while(map_p->name != NULL)
- {
- if(!strcasecmp(map_p->name, encoding))
- {
- return map_p;
- }
- map_p++;
- }
- return NULL;
+ if (!current_charset)
+ convert_set_charset(0);
+ charset = current_charset ? current_charset : "US-ASCII";
+ return convert_string(charset, "UTF-8", from, to, '#');
}
-#endif /* The rest is used by everthing */
-
-char *make_utf8_string(const unsigned short *unicode)
+int convert_from_utf8(const char *from, char **to)
{
- int size = 0, index = 0, out_index = 0;
- unsigned char *out;
- unsigned short c;
-
- /* first calculate the size of the target string */
- c = unicode[index++];
- while(c) {
- if(c < 0x0080) {
- size += 1;
- } else if(c < 0x0800) {
- size += 2;
- } else {
- size += 3;
- }
- c = unicode[index++];
- }
-
- out = malloc(size + 1);
- index = 0;
-
- c = unicode[index++];
- while(c)
- {
- if(c < 0x080) {
- out[out_index++] = c;
- } else if(c < 0x800) {
- out[out_index++] = 0xc0 | (c >> 6);
- out[out_index++] = 0x80 | (c & 0x3f);
- } else {
- out[out_index++] = 0xe0 | (c >> 12);
- out[out_index++] = 0x80 | ((c >> 6) & 0x3f);
- out[out_index++] = 0x80 | (c & 0x3f);
- }
- c = unicode[index++];
- }
- out[out_index] = 0x00;
+ char *charset;
- return out;
+ if (!current_charset)
+ convert_set_charset(0);
+ charset = current_charset ? current_charset : "US-ASCII";
+ return convert_string("UTF-8", charset, from, to, '?');
}
+#endif
diff -ru xiph.orig/vorbis-tools/vorbiscomment/vcomment.c xiph/vorbis-tools/vorbiscomment/vcomment.c
--- xiph.orig/vorbis-tools/vorbiscomment/vcomment.c Wed Sep 26 20:28:05 2001
+++ xiph/vorbis-tools/vorbiscomment/vcomment.c Sun Sep 30 19:04:03 2001
@@ -12,6 +12,7 @@
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
+#include <locale.h>
#include "getopt.h"
#include "utf8.h"
@@ -24,7 +25,6 @@
{"help",0,0,'h'},
{"quiet",0,0,'q'},
{"commentfile",1,0,'c'},
- {"encoding", 1,0,'e'},
{NULL,0,0,0}
};
@@ -37,7 +37,6 @@
int commentcount;
char **comments;
int tempoutfile;
- char *encoding;
} param_t;
#define MODE_NONE 0
@@ -47,8 +46,8 @@
/* prototypes */
void usage(void);
-void print_comments(FILE *out, vorbis_comment *vc, char *encoding);
-int add_comment(char *line, vorbis_comment *vc, char *encoding);
+void print_comments(FILE *out, vorbis_comment *vc);
+int add_comment(char *line, vorbis_comment *vc);
param_t *new_param(void);
void parse_options(int argc, char *argv[], param_t *param);
@@ -98,7 +97,7 @@
/* extract and display the comments */
vc = vcedit_comments(state);
- print_comments(param->com, vc, param->encoding);
+ print_comments(param->com, vc);
/* done */
vcedit_clear(state);
@@ -128,7 +127,7 @@
for(i=0; i < param->commentcount; i++)
{
- if(add_comment(param->comments[i], vc, param->encoding) < 0)
+ if(add_comment(param->comments[i], vc) < 0)
fprintf(stderr, "Bad comment: \"%s\"\n", param->comments[i]);
}
@@ -139,7 +138,7 @@
char *buf = (char *)malloc(sizeof(char)*1024);
while (fgets(buf, 1024, param->com))
- if (add_comment(buf, vc, param->encoding) < 0) {
+ if (add_comment(buf, vc) < 0) {
fprintf(stderr,
"bad comment: \"%s\"\n",
buf);
@@ -177,14 +176,14 @@
***********/
-void print_comments(FILE *out, vorbis_comment *vc, char *encoding)
+void print_comments(FILE *out, vorbis_comment *vc)
{
int i;
char *decoded_value;
for (i = 0; i < vc->comments; i++)
{
- if (utf8_decode(vc->user_comments[i], &decoded_value, encoding) == 0)
+ if (convert_from_utf8(vc->user_comments[i], &decoded_value) >= 0)
{
fprintf(out, "%s\n", decoded_value);
free(decoded_value);
@@ -197,7 +196,7 @@
/**********
Take a line of the form "TAG=value string", parse it, convert the
- value to UTF-8 from the specified encoding, and add it to the
+ value to UTF-8, and add it to the
vorbis_comment structure. Error checking is performed.
Note that this assumes a null-terminated string, which may cause
@@ -205,7 +204,7 @@
***********/
-int add_comment(char *line, vorbis_comment *vc, char *encoding)
+int add_comment(char *line, vorbis_comment *vc)
{
char *mark, *value, *utf8_value;
@@ -234,7 +233,7 @@
value++;
/* convert the value from the native charset to UTF-8 */
- if (utf8_encode(value, &utf8_value, encoding) == 0) {
+ if (convert_to_utf8(value, &utf8_value) >= 0) {
/* append the comment and return */
vorbis_comment_add_tag(vc, line, utf8_value);
@@ -307,9 +306,6 @@
param->comments=NULL;
param->tempoutfile=0;
- /* character encoding */
- param->encoding = "ISO-8859-1";
-
return param;
}
@@ -327,7 +323,9 @@
int ret;
int option_index = 1;
- while ((ret = getopt_long(argc, argv, "ae:lwhqc:t:",
+ setlocale(LC_ALL, "");
+
+ while ((ret = getopt_long(argc, argv, "alwhqc:t:",
long_options, &option_index)) != -1) {
switch (ret) {
case 0:
@@ -342,9 +340,6 @@
break;
case 'a':
param->mode = MODE_APPEND;
- break;
- case 'e':
- param->encoding = strdup(optarg);
break;
case 'h':
usage();
--- >8 ----
List archives: http://www.xiph.org/archives/
Ogg project homepage: http://www.xiph.org/ogg/
To unsubscribe from this list, send a message to 'vorbis-dev-request at xiph.org'
containing only the word 'unsubscribe' in the body. No subject is needed.
Unsubscribe messages sent to the list will be ignored/filtered.
More information about the Vorbis-dev
mailing list