使用iconv的简单UTF8-> UTF16字符串转换 [英] Simple UTF8->UTF16 string conversion with iconv

查看:490
本文介绍了使用iconv的简单UTF8-> UTF16字符串转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写一个将UTF8字符串转换为UTF16(小尾数)的函数.问题是iconv函数似乎并没有让您提前知道需要多少字节来存储输出字符串.

I want to write a function to convert a UTF8 string to UTF16 (little-endian). The problem is, the iconv function does not seem to let you know in advance how many bytes you'll need to store the output string.

我的解决方案是从分配2*strlen(utf8)开始,然后循环运行iconv,并在必要时使用realloc增加该缓冲区的大小:

My solution is to start by allocating 2*strlen(utf8), and then run iconv in a loop, increasing the size of that buffer with realloc if necessary:

static int utf8_to_utf16le(char *utf8, char **utf16, int *utf16_len)
{
    iconv_t cd;
    char *inbuf, *outbuf;
    size_t inbytesleft, outbytesleft, nchars, utf16_buf_len;

    cd = iconv_open("UTF16LE", "UTF8");
    if (cd == (iconv_t)-1) {
        printf("!%s: iconv_open failed: %d\n", __func__, errno);
        return -1;
    }

    inbytesleft = strlen(utf8);
    if (inbytesleft == 0) {
        printf("!%s: empty string\n", __func__);
        iconv_close(cd);
        return -1;
    }
    inbuf = utf8;
    utf16_buf_len = 2 * inbytesleft;            // sufficient in many cases, i.e. if the input string is ASCII
    *utf16 = malloc(utf16_buf_len);
    if (!*utf16) {
        printf("!%s: malloc failed\n", __func__);
        iconv_close(cd);
        return -1;
    }
    outbytesleft = utf16_buf_len;
    outbuf = *utf16;

    nchars = iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
    while (nchars == (size_t)-1 && errno == E2BIG) {
        char *ptr;
        size_t increase = 10;                   // increase length a bit
        size_t len;
        utf16_buf_len += increase;
        outbytesleft += increase;
        ptr = realloc(*utf16, utf16_buf_len);
        if (!ptr) {
            printf("!%s: realloc failed\n", __func__);
            free(*utf16);
            iconv_close(cd);
            return -1;
        }
        len = outbuf - *utf16;
        *utf16 = ptr;
        outbuf = *utf16 + len;
        nchars = iconv(cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
    }
    if (nchars == (size_t)-1) {
        printf("!%s: iconv failed: %d\n", __func__, errno);
        free(*utf16);
        iconv_close(cd);
        return -1;
    }

    iconv_close(cd);
    *utf16_len = utf16_buf_len - outbytesleft;

    return 0;
}

这真的是最好的方法吗?重复的realloc似乎是浪费的,但是不知道utf8中可能包含哪些字符序列以及它们会导致utf16产生什么,我不知道是否可以比2*strlen(utf8)更好地猜测初始缓冲区大小

Is this really the best way to do it? Repeated reallocs seems wasteful, but without knowing what character sequences could be in the utf8, and what they would result in in utf16, I don't know if I can make a better guess for the initial buffer size than 2*strlen(utf8).

推荐答案

这是使用iconv的正确方法.

请记住,iconv被设计为能够从任意字符编码重新编码为另一种任意字符编码.它支持任何组合.鉴于此,从根本上说,实际上只有两种方法可以知道输出所需的空间:

Remember that iconv is designed to be able to recode from an arbitrary character encoding to another arbitrary character encoding. It supports any combination. Given this, there are fundamentally really only 2 ways to know how much space you need on output:

  1. 猜一猜.进行转换,并在必要时增加猜测.
  2. 进行两次转换.第一次,只是数数,丢弃输出.分配您计算的总空间量,然后再次进行转换.

首先是你要做的.第二个显然有一个缺点,那就是您必须做两次工作. (顺便说一句,您可以使用iconv使用第二种方法,方法是将局部变量中的暂存器缓冲区用作第一遍的输出缓冲区.)

The first is what you do. The second one obviously has the disadvantage that you have to do the work twice. (By the way, you could do it the second way with iconv by using a scratchpad buffer in a local variable as the output buffer for the first pass.)

真的没有别的办法了.您要么事先知道输入中有多少个字符(不是字节),BMP中有/没有?否则,您就不必数了.

There's really no other way. Either you know in advance how many characters (not bytes) there are in the input and how many of them are/aren't in the BMP; or you don't and you have to count them.

在这种情况下,您碰巧知道什么输入和输出编码会提前.如果在开始之前自己对输入字符串进行一些UTF-8体操训练,则可以更好地猜测所需的输出缓冲区空间量.这有点类似于上面的第二个选项,但是由于需要的UTF-8体操不如成熟的iconv昂贵,因此进行了优化.

In this case you happen to know what the input and output encodings will be ahead of time. You could do a better job of guessing the amount of output buffer space you need if you do some UTF-8 gymnastics on the input string yourself before starting. This is a bit like the second option above, but more optimized because the necessary UTF-8 gymnastics are not as expensive as full-blown iconv.

不过,我还是建议您不要这样做.您仍然需要在输入字符串上进行两次传递,因此您不会节省那么多的代码,这将为您编写更多的代码,并且如果缓冲区大小过小,则会引入错误的可能性.体操不太正确.

Let me recommend that you don't do that, though. You'd still be making two passes on the input string so you wouldn't be saving that much, it would be a lot more code for you to write, and it introduces the possibility of a bug where the buffer could be undersized if the gymnastics aren't quite right.

我什至不打算描述体操,因为它实际上或多或少地构成了UTF-8解码器,尽管它的核心只是一些简单的位屏蔽和移位的情况,是与拒绝无效序列相关的详细信息,这些序列很容易以安全隐患被弄错.所以不要这样做.

I'm not even going to describe the gymnastics because what it really amounts to more or less is implementing a UTF-8 decoder, and, though the core of it is just a few simple cases of bit masking and shifting, there are details related to rejecting invalid sequences that are easy to get wrong in a way that has security implications. So don't do it.

这篇关于使用iconv的简单UTF8-> UTF16字符串转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆