从iso-8859-15(Latin9)转换为UTF-8? [英] Conversion from iso-8859-15 (Latin9) to UTF-8?
问题描述
我需要将一些用Latin9字符集格式化的字符串转换为UTF-8.我不能使用iconv,因为它不包含在我的嵌入式系统中.你知道是否有一些可用的代码吗?
I need to convert some strings formated with Latin9 charset to UTF-8. I cannot use iconv as it is not included in my embedded system. Do you know if there is some available code for it?
推荐答案
代码点1
至127
在Latin-9(ISO-8859-15)和UTF-8中都是相同的.
Code points 1
to 127
are the same in both Latin-9 (ISO-8859-15) and UTF-8.
拉丁文9中的代码点164
是U + 20AC,UTF-8中是\ xe2 \ x82 \ xac = 226 130 172
.
Latin-9中的代码点166
是U + 0160,UTF-8中的\ xc5 \ xa0 = 197 160
.
Latin-9中的代码点168
是U + 0161,UTF-8中的\ xc5 \ xa1 = 197 161
.
Latin-9中的代码点180
是U + 017D,UTF-8中的\ xc5 \ xbd = 197 189
.
Latin-9中的代码点184
是U + 017E,UTF-8中的\ xc5 \ xbe = 197 190
.
Latin-9中的代码点188
是U + 0152,UTF-8中的\ xc5 \ x92 = 197 146
.
拉丁语9中的代码点189
是U + 0153,UTF-8中的\ xc5 \ x93 = 197 147
.
Latin-9中的代码点190
是U + 0178,UTF-8中的\ xc5 \ xb8 = 197 184
.
Code point 164
in Latin-9 is U+20AC, \xe2\x82\xac = 226 130 172
in UTF-8.
Code point 166
in Latin-9 is U+0160, \xc5\xa0 = 197 160
in UTF-8.
Code point 168
in Latin-9 is U+0161, \xc5\xa1 = 197 161
in UTF-8.
Code point 180
in Latin-9 is U+017D, \xc5\xbd = 197 189
in UTF-8.
Code point 184
in Latin-9 is U+017E, \xc5\xbe = 197 190
in UTF-8.
Code point 188
in Latin-9 is U+0152, \xc5\x92 = 197 146
in UTF-8.
Code point 189
in Latin-9 is U+0153, \xc5\x93 = 197 147
in UTF-8.
Code point 190
in Latin-9 is U+0178, \xc5\xb8 = 197 184
in UTF-8.
拉丁文9中的代码点128 .. 191
(上面列出的除外)都映射到\ xc2 \ x80 .. UTF-8中的\ xc2 \ xbf = 194 128 .. 194 191
.
Code points 128 .. 191
(except for those listed above) in Latin-9 all map to \xc2\x80 .. \xc2\xbf = 194 128 .. 194 191
in UTF-8.
拉丁文9中的代码点192 .. 255
都映射到\ xc3 \ x80 .. \ xc3 \ xbf = 195 128 .. 195 191
在UTF-8中.
Code points 192 .. 255
in Latin-9 all map to \xc3\x80 .. \xc3\xbf = 195 128 .. 195 191
in UTF-8.
这意味着Latin-9代码点1..127在UTF-8中长为一个字节,代码点164为三个字节长,其余的(128..163和165..255)为两个字节长.
This means that Latin-9 code points 1..127 are one byte long in UTF-8, code point 164 is three bytes long, and the rest (128..163 and 165..255) are two bytes long.
如果您首先扫描Latin-9输入字符串,则可以确定所得UTF-8字符串的长度.毕竟,如果您需要或需要在嵌入式系统上工作,则可以通过从头到尾的反向操作来就地进行转换.
If you first scan the Latin-9 input string, you can determine the length of the resulting UTF-8 string. If you want or need to -- you're working on an embedded system, after all -- you can then do the conversion in-place, by working backwards from the end towards the start.
以下是您可以使用两种方法进行转换的两种功能.使用后,这些将返回您需要的动态分配的副本到free()
.它们仅在发生错误(内存不足,errno == ENOMEM
)时返回NULL
.如果给出要转换的NULL
或空字符串,则这些函数将返回一个空的动态分配的字符串.
Here are two functions you can use for the conversion either way. These return a dynamically allocated copy you need to free()
after use. They only return NULL
when an error occurs (out of memory, errno == ENOMEM
). If given a NULL
or empty string to convert from, the functions return an empty dynamically allocated string.
换句话说,完成这些函数后,应始终在这些函数返回的指针上调用free()
. (free(NULL)
被允许,但不执行任何操作.)
In other words, you should always call free()
on the pointer returned by these functions when you are done with them. (free(NULL)
is allowed and does nothing.)
如果输入不包含零字节,则已验证latin9_to_utf8()
产生与iconv
完全相同的输出.该函数使用标准的C字符串,即零字节表示字符串的结尾.
The latin9_to_utf8()
has been verified to produce the exact same output as iconv
if the input contains no zero bytes. The function uses standard C strings, i.e. zero byte indicates end of string.
如果输入中还包含ISO-8859-15中的Unicode代码点,并且不包含零字节,则已经验证了utf8_to_latin9()
可以产生与iconv
完全相同的输出.当给定随机UTF-8字符串时,该函数将Latin-1的八个代码点映射为等效于Latin-9的等价货币,即将货币符号映射为欧元; iconv会忽略它们或考虑这些错误.
The utf8_to_latin9()
has been verified to produce the exact same output as iconv
if the input contains only Unicode code points also in ISO-8859-15, and no zero bytes. When given random UTF-8 strings, the function maps the eight code points in Latin-1 to Latin-9 equivalents, i.e. currency sign to euro; iconv either ignores them or considers those errors.
utf8_to_latin9()
行为表示该函数同时适用于 Latin 1
-> UTF-8
-> Latin 1
和 Latin 9
-> UTF-8
-> Latin9
往返.
The utf8_to_latin9()
behaviour means that the functions are suitable for both Latin 1
->UTF-8
->Latin 1
and Latin 9
->UTF-8
->Latin9
round-trips.
#include <stdlib.h> /* for realloc() and free() */
#include <string.h> /* for memset() */
#include <errno.h> /* for errno */
/* Create a dynamically allocated copy of string,
* changing the encoding from ISO-8859-15 to UTF-8.
*/
char *latin9_to_utf8(const char *const string)
{
char *result;
size_t n = 0;
if (string) {
const unsigned char *s = (const unsigned char *)string;
while (*s)
if (*s < 128) {
s++;
n += 1;
} else
if (*s == 164) {
s++;
n += 3;
} else {
s++;
n += 2;
}
}
/* Allocate n+1 (to n+7) bytes for the converted string. */
result = malloc((n | 7) + 1);
if (!result) {
errno = ENOMEM;
return NULL;
}
/* Clear the tail of the string, setting the trailing NUL. */
memset(result + (n | 7) - 7, 0, 8);
if (n) {
const unsigned char *s = (const unsigned char *)string;
unsigned char *d = (unsigned char *)result;
while (*s)
if (*s < 128) {
*(d++) = *(s++);
} else
if (*s < 192) switch (*s) {
case 164: *(d++) = 226; *(d++) = 130; *(d++) = 172; s++; break;
case 166: *(d++) = 197; *(d++) = 160; s++; break;
case 168: *(d++) = 197; *(d++) = 161; s++; break;
case 180: *(d++) = 197; *(d++) = 189; s++; break;
case 184: *(d++) = 197; *(d++) = 190; s++; break;
case 188: *(d++) = 197; *(d++) = 146; s++; break;
case 189: *(d++) = 197; *(d++) = 147; s++; break;
case 190: *(d++) = 197; *(d++) = 184; s++; break;
default: *(d++) = 194; *(d++) = *(s++); break;
} else {
*(d++) = 195;
*(d++) = *(s++) - 64;
}
}
/* Done. Remember to free() the resulting string when no longer needed. */
return result;
}
/* Create a dynamically allocated copy of string,
* changing the encoding from UTF-8 to ISO-8859-15.
* Unsupported code points are ignored.
*/
char *utf8_to_latin9(const char *const string)
{
size_t size = 0;
size_t used = 0;
unsigned char *result = NULL;
if (string) {
const unsigned char *s = (const unsigned char *)string;
while (*s) {
if (used >= size) {
void *const old = result;
size = (used | 255) + 257;
result = realloc(result, size);
if (!result) {
if (old)
free(old);
errno = ENOMEM;
return NULL;
}
}
if (*s < 128) {
result[used++] = *(s++);
continue;
} else
if (s[0] == 226 && s[1] == 130 && s[2] == 172) {
result[used++] = 164;
s += 3;
continue;
} else
if (s[0] == 194 && s[1] >= 128 && s[1] <= 191) {
result[used++] = s[1];
s += 2;
continue;
} else
if (s[0] == 195 && s[1] >= 128 && s[1] <= 191) {
result[used++] = s[1] + 64;
s += 2;
continue;
} else
if (s[0] == 197 && s[1] == 160) {
result[used++] = 166;
s += 2;
continue;
} else
if (s[0] == 197 && s[1] == 161) {
result[used++] = 168;
s += 2;
continue;
} else
if (s[0] == 197 && s[1] == 189) {
result[used++] = 180;
s += 2;
continue;
} else
if (s[0] == 197 && s[1] == 190) {
result[used++] = 184;
s += 2;
continue;
} else
if (s[0] == 197 && s[1] == 146) {
result[used++] = 188;
s += 2;
continue;
} else
if (s[0] == 197 && s[1] == 147) {
result[used++] = 189;
s += 2;
continue;
} else
if (s[0] == 197 && s[1] == 184) {
result[used++] = 190;
s += 2;
continue;
}
if (s[0] >= 192 && s[0] < 224 &&
s[1] >= 128 && s[1] < 192) {
s += 2;
continue;
} else
if (s[0] >= 224 && s[0] < 240 &&
s[1] >= 128 && s[1] < 192 &&
s[2] >= 128 && s[2] < 192) {
s += 3;
continue;
} else
if (s[0] >= 240 && s[0] < 248 &&
s[1] >= 128 && s[1] < 192 &&
s[2] >= 128 && s[2] < 192 &&
s[3] >= 128 && s[3] < 192) {
s += 4;
continue;
} else
if (s[0] >= 248 && s[0] < 252 &&
s[1] >= 128 && s[1] < 192 &&
s[2] >= 128 && s[2] < 192 &&
s[3] >= 128 && s[3] < 192 &&
s[4] >= 128 && s[4] < 192) {
s += 5;
continue;
} else
if (s[0] >= 252 && s[0] < 254 &&
s[1] >= 128 && s[1] < 192 &&
s[2] >= 128 && s[2] < 192 &&
s[3] >= 128 && s[3] < 192 &&
s[4] >= 128 && s[4] < 192 &&
s[5] >= 128 && s[5] < 192) {
s += 6;
continue;
}
s++;
}
}
{
void *const old = result;
size = (used | 7) + 1;
result = realloc(result, size);
if (!result) {
if (old)
free(old);
errno = ENOMEM;
return NULL;
}
memset(result + used, 0, size - used);
}
return (char *)result;
}
虽然iconv()
通常是字符集转换的正确解决方案,但上述两个功能在嵌入式或其他受限环境中肯定有用.
While iconv()
is the correct solution for character set conversions in general, the two functions above are certainly useful in an embedded or otherwise constricted environment.
这篇关于从iso-8859-15(Latin9)转换为UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!