从iso-8859-15(Latin9)转换为UTF-8? [英] Conversion from iso-8859-15 (Latin9) to UTF-8?

查看:155
本文介绍了从iso-8859-15(Latin9)转换为UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将一些用Latin9字符集格式化的字符串转换为UTF-8.我不能使用iconv,因为它不包含在我的嵌入式系统中.你知道是否有一些可用的代码吗?

I need to convert some strings formated with Latin9 charset to UTF-8. I cannot use iconv as it is not included in my embedded system. Do you know if there is some available code for it?

推荐答案

代码点1127在Latin-9(ISO-8859-15)和UTF-8中都是相同的.

Code points 1 to 127 are the same in both Latin-9 (ISO-8859-15) and UTF-8.

拉丁文9中的代码点164是U + 20AC,UTF-8中是\ xe2 \ x82 \ xac = 226 130 172.
Latin-9中的代码点166是U + 0160,UTF-8中的\ xc5 \ xa0 = 197 160.
Latin-9中的代码点168是U + 0161,UTF-8中的\ xc5 \ xa1 = 197 161.
Latin-9中的代码点180是U + 017D,UTF-8中的\ xc5 \ xbd = 197 189.
Latin-9中的代码点184是U + 017E,UTF-8中的\ xc5 \ xbe = 197 190.
Latin-9中的代码点188是U + 0152,UTF-8中的\ xc5 \ x92 = 197 146.
拉丁语9中的代码点189是U + 0153,UTF-8中的\ xc5 \ x93 = 197 147.
Latin-9中的代码点190是U + 0178,UTF-8中的\ xc5 \ xb8 = 197 184.

Code point 164 in Latin-9 is U+20AC, \xe2\x82\xac = 226 130 172 in UTF-8.
Code point 166 in Latin-9 is U+0160, \xc5\xa0 = 197 160 in UTF-8.
Code point 168 in Latin-9 is U+0161, \xc5\xa1 = 197 161 in UTF-8.
Code point 180 in Latin-9 is U+017D, \xc5\xbd = 197 189 in UTF-8.
Code point 184 in Latin-9 is U+017E, \xc5\xbe = 197 190 in UTF-8.
Code point 188 in Latin-9 is U+0152, \xc5\x92 = 197 146 in UTF-8.
Code point 189 in Latin-9 is U+0153, \xc5\x93 = 197 147 in UTF-8.
Code point 190 in Latin-9 is U+0178, \xc5\xb8 = 197 184 in UTF-8.

拉丁文9中的代码点128 .. 191(上面列出的除外)都映射到\ xc2 \ x80 .. UTF-8中的\ xc2 \ xbf = 194 128 .. 194 191.

Code points 128 .. 191 (except for those listed above) in Latin-9 all map to \xc2\x80 .. \xc2\xbf = 194 128 .. 194 191 in UTF-8.

拉丁文9中的代码点192 .. 255都映射到\ xc3 \ x80 .. \ xc3 \ xbf = 195 128 .. 195 191在UTF-8中.

Code points 192 .. 255 in Latin-9 all map to \xc3\x80 .. \xc3\xbf = 195 128 .. 195 191 in UTF-8.

这意味着Latin-9代码点1..127在UTF-8中长为一个字节,代码点164为三个字节长,其余的(128..163和165..255)为两个字节长.

This means that Latin-9 code points 1..127 are one byte long in UTF-8, code point 164 is three bytes long, and the rest (128..163 and 165..255) are two bytes long.

如果您首先扫描Latin-9输入字符串,则可以确定所得UTF-8字符串的长度.毕竟,如果您需要或需要在嵌入式系统上工作,则可以通过从头到尾的反向操作来就地进行转换.

If you first scan the Latin-9 input string, you can determine the length of the resulting UTF-8 string. If you want or need to -- you're working on an embedded system, after all -- you can then do the conversion in-place, by working backwards from the end towards the start.

以下是您可以使用两种方法进行转换的两种功能.使用后,这些将返回您需要的动态分配的副本到free().它们仅在发生错误(内存不足,errno == ENOMEM)时返回NULL.如果给出要转换的NULL或空字符串,则这些函数将返回一个空的动态分配的字符串.

Here are two functions you can use for the conversion either way. These return a dynamically allocated copy you need to free() after use. They only return NULL when an error occurs (out of memory, errno == ENOMEM). If given a NULL or empty string to convert from, the functions return an empty dynamically allocated string.

换句话说,完成这些函数后,应始终在这些函数返回的指针上调用free(). (free(NULL)被允许,但不执行任何操作.)

In other words, you should always call free() on the pointer returned by these functions when you are done with them. (free(NULL) is allowed and does nothing.)

如果输入不包含零字节,则已验证latin9_to_utf8()产生与iconv完全相同的输出.该函数使用标准的C字符串,即零字节表示字符串的结尾.

The latin9_to_utf8() has been verified to produce the exact same output as iconv if the input contains no zero bytes. The function uses standard C strings, i.e. zero byte indicates end of string.

如果输入中还包含ISO-8859-15中的Unicode代码点,并且不包含零字节,则已经验证了utf8_to_latin9()可以产生与iconv完全相同的输出.当给定随机UTF-8字符串时,该函数将Latin-1的八个代码点映射为等效于Latin-9的等价货币,即将货币符号映射为欧元; iconv会忽略它们或考虑这些错误.

The utf8_to_latin9() has been verified to produce the exact same output as iconv if the input contains only Unicode code points also in ISO-8859-15, and no zero bytes. When given random UTF-8 strings, the function maps the eight code points in Latin-1 to Latin-9 equivalents, i.e. currency sign to euro; iconv either ignores them or considers those errors.

utf8_to_latin9()行为表示该函数同时适用于 Latin 1-> UTF-8-> Latin 1 Latin 9-> UTF-8-> Latin9往返.

The utf8_to_latin9() behaviour means that the functions are suitable for both Latin 1->UTF-8->Latin 1 and Latin 9->UTF-8->Latin9 round-trips.

#include <stdlib.h>     /* for realloc() and free() */
#include <string.h>     /* for memset() */
#include <errno.h>      /* for errno */

/* Create a dynamically allocated copy of string,
 * changing the encoding from ISO-8859-15 to UTF-8.
*/
char *latin9_to_utf8(const char *const string)
{
    char   *result;
    size_t  n = 0;

    if (string) {
        const unsigned char  *s = (const unsigned char *)string;

        while (*s)
            if (*s < 128) {
                s++;
                n += 1;
            } else
            if (*s == 164) {
                s++;
                n += 3;
            } else {
                s++;
                n += 2;
            }
    }

    /* Allocate n+1 (to n+7) bytes for the converted string. */
    result = malloc((n | 7) + 1);
    if (!result) {
        errno = ENOMEM;
        return NULL;
    }

    /* Clear the tail of the string, setting the trailing NUL. */
    memset(result + (n | 7) - 7, 0, 8);

    if (n) {
        const unsigned char  *s = (const unsigned char *)string;
        unsigned char        *d = (unsigned char *)result;

        while (*s)
            if (*s < 128) {
                *(d++) = *(s++);
            } else
            if (*s < 192) switch (*s) {
                case 164: *(d++) = 226; *(d++) = 130; *(d++) = 172; s++; break;
                case 166: *(d++) = 197; *(d++) = 160; s++; break;
                case 168: *(d++) = 197; *(d++) = 161; s++; break;
                case 180: *(d++) = 197; *(d++) = 189; s++; break;
                case 184: *(d++) = 197; *(d++) = 190; s++; break;
                case 188: *(d++) = 197; *(d++) = 146; s++; break;
                case 189: *(d++) = 197; *(d++) = 147; s++; break;
                case 190: *(d++) = 197; *(d++) = 184; s++; break;
                default:  *(d++) = 194; *(d++) = *(s++); break;
            } else {
                *(d++) = 195;
                *(d++) = *(s++) - 64;
            }
    }

    /* Done. Remember to free() the resulting string when no longer needed. */
    return result;
}

/* Create a dynamically allocated copy of string,
 * changing the encoding from UTF-8 to ISO-8859-15.
 * Unsupported code points are ignored.
*/
char *utf8_to_latin9(const char *const string)
{
    size_t         size = 0;
    size_t         used = 0;
    unsigned char *result = NULL;

    if (string) {
        const unsigned char  *s = (const unsigned char *)string;

        while (*s) {

            if (used >= size) {
                void *const old = result;

                size = (used | 255) + 257;
                result = realloc(result, size);
                if (!result) {
                    if (old)
                        free(old);
                    errno = ENOMEM;
                    return NULL;
                }
            }

            if (*s < 128) {
                result[used++] = *(s++);
                continue;

            } else
            if (s[0] == 226 && s[1] == 130 && s[2] == 172) {
                result[used++] = 164;
                s += 3;
                continue;

            } else
            if (s[0] == 194 && s[1] >= 128 && s[1] <= 191) {
                result[used++] = s[1];
                s += 2;
                continue;

            } else
            if (s[0] == 195 && s[1] >= 128 && s[1] <= 191) {
                result[used++] = s[1] + 64;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 160) {
                result[used++] = 166;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 161) {
                result[used++] = 168;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 189) {
                result[used++] = 180;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 190) {
                result[used++] = 184;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 146) {
                result[used++] = 188;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 147) {
                result[used++] = 189;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 184) {
                result[used++] = 190;
                s += 2;
                continue;

            }

            if (s[0] >= 192 && s[0] < 224 &&
                s[1] >= 128 && s[1] < 192) {
                s += 2;
                continue;
            } else
            if (s[0] >= 224 && s[0] < 240 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192) {
                s += 3;
                continue;
            } else
            if (s[0] >= 240 && s[0] < 248 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192) {
                s += 4;
                continue;
            } else
            if (s[0] >= 248 && s[0] < 252 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192 &&
                s[4] >= 128 && s[4] < 192) {
                s += 5;
                continue;
            } else
            if (s[0] >= 252 && s[0] < 254 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192 &&
                s[4] >= 128 && s[4] < 192 &&
                s[5] >= 128 && s[5] < 192) {
                s += 6;
                continue;
            }

            s++;
        }
    }

    {
        void *const old = result;

        size = (used | 7) + 1;

        result = realloc(result, size);
        if (!result) {
            if (old)
                free(old);
            errno = ENOMEM;
            return NULL;
        }

        memset(result + used, 0, size - used);
    }

    return (char *)result;
}

虽然iconv()通常是字符集转换的正确解决方案,但上述两个功能在嵌入式或其他受限环境中肯定有用.

While iconv() is the correct solution for character set conversions in general, the two functions above are certainly useful in an embedded or otherwise constricted environment.

这篇关于从iso-8859-15(Latin9)转换为UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆