使用ICU库将UTF-8转换为UCS-2 [英] UTF-8 to UCS-2 conversion with icu library

查看:527
本文介绍了使用ICU库将UTF-8转换为UCS-2的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用icu库将UTF-8字符串转换为UCS-2字符串时遇到问题.库中有多种方法可以执行此操作,但是到目前为止,似乎没有一种方法有效,但是考虑到该库的普及性,我是在假设我做错了什么.

I'm currently working on and hitting an issue with converting a UTF-8 string to a UCS-2 string with the icu library. There are several number of ways to do this in the library, but so far none of them seem to be working, but considering the popularity of this library I'm under the assumption that I'm doing something wrong.

首先是通用代码.在所有情况下,我都会在对象上创建并传递一个字符串,但是直到它到达转换步骤之前,都无法进行操作.

First off is the common code. In all cases I'm creating and passing a string on an object, but until it reaches the conversion steps there is no manipulation.

当前使用的utf-8字符串就是ĩ".

The currently utf-8 string being used is simply "ĩ".

为简单起见,我将在此代码中表示用作uniString的字符串

For the sake of simplicity I'll represent the string being used as uniString in this code

UErrorCode resultCode = U_ZERO_ERROR;

UConverter* m_pConv = ucnv_open("ISO-8859-1", &resultCode);

// Change the callback to error out instead of the default            
const void* oldContext;
UConverterFromUCallback oldFromAction;
UConverterToUCallback oldToAction;
ucnv_setFromUCallBack(m_pConv, UCNV_FROM_U_CALLBACK_STOP, NULL, &oldFromAction, &oldContext, &resultCode);
ucnv_setToUCallBack(m_pConv, UCNV_TO_U_CALLBACK_STOP, NULL, &oldToAction, &oldContext, &resultCode);

int32_t outputLength = 0;
int bodySize = uniString.length();
int targetSize = bodySize * 4;
char* target = new char[targetSize];                       

printf("Body: %s\n", uniString.c_str());
if (U_SUCCESS(resultCode))
{
    // outputLength = ucnv_convert("ISO-8859-1", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
    outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
        uniString.length(), &resultCode);
    ucnv_close(m_pConv);
}
printf("ISO-8859-1 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
    outputLength ? target : "invalid_char", resultCode, outputLength);

if (resultCode == U_INVALID_CHAR_FOUND || resultCode == U_ILLEGAL_CHAR_FOUND || resultCode == U_TRUNCATED_CHAR_FOUND)
{
    if (resultCode == U_INVALID_CHAR_FOUND)
    {
        printf("Unmapped input character, cannot be converted to Latin1");                    

        m_pConv = ucnv_open("UCS-2", &resultCode);
        if (U_SUCCESS(resultCode))
        {
            // outputLength = ucnv_convert("UCS-2", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
            outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
                uniString.length(), &resultCode);
            ucnv_close(m_pConv);
        }

        printf("UCS-2 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
            outputLength ? target : "invalid_char", resultCode, outputLength);

        if (U_SUCCESS(resultCode))
        {
            pdus = SegmentText(target, pText, SEGMENT_SIZE_UNICODE_MAX, true);
        }
    }
    else
    {
        printf("DecodeText(): Text contents does not appear to be valid UTF-8");
    }
}
else
{
    printf("DecodeText(): Text successfully converted to Latin1");
    std::string newBody(target, outputLength);
    pdus = SegmentText(newBody, pPdu, SEGMENT_SIZE_MAX);
}

问题是ucnv_fromAlgorithmic函数为ucs-2转换引发错误U_INVALID_CHAR_FOUND.这对于ISO-8859-1尝试有意义,但对ucs-2没有意义.

The problem is the ucnv_fromAlgorithmic function is throwing an error U_INVALID_CHAR_FOUND for the ucs-2 conversion. This makes sense for the ISO-8859-1 attempt, but not the ucs-2.

另一种尝试是使用ucnv_convert,您可以看到它已注释掉.该函数尝试进行转换,但没有像尝试的那样在ISO-8859-1尝试中失败.

The other attempt was to use ucnv_convert which you can see is commented out. This function attempted conversion, but didn't fail on the ISO-8859-1 attempt as it should.

问题是,有人有没有使用过这些功能的经验,并且看到关于此字符的转换假设有误吗?

So the question is, does anyone have experience with these function and see something incorrect or is there something incorrect about the assumption of conversion for this character?

推荐答案

您需要在调用ucnv_open之前将resultCode重置为U_ZERO_ERROR.引用自手册:

You need to reset resultCode to U_ZERO_ERROR before calling ucnv_open. Quote from manual:

使用UErrorCode的引用(C ++)或指针(C)的ICU函数首先测试if(U_FAILURE(errorCode)){立即返回;},以便在这样的函数链中,第一个设置错误代码导致以下操作不执行任何操作"

"ICU functions that take a reference (C++) or a pointer (C) to a UErrorCode first test if(U_FAILURE(errorCode)) { return immediately; } so that in a chain of such functions the first one that sets an error code causes the following ones to not perform any operation"

这篇关于使用ICU库将UTF-8转换为UCS-2的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆