使用ICU库将UTF-8转换为UCS-2 [英] UTF-8 to UCS-2 conversion with icu library

查看：527 发布时间：2020/7/13 5:29:37 c++ unicode utf-8 icu ucs2

本文介绍了使用ICU库将UTF-8转换为UCS-2的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前正在使用icu库将UTF-8字符串转换为UCS-2字符串时遇到问题.库中有多种方法可以执行此操作，但是到目前为止，似乎没有一种方法有效，但是考虑到该库的普及性，我是在假设我做错了什么.

I'm currently working on and hitting an issue with converting a UTF-8 string to a UCS-2 string with the icu library. There are several number of ways to do this in the library, but so far none of them seem to be working, but considering the popularity of this library I'm under the assumption that I'm doing something wrong.

首先是通用代码.在所有情况下，我都会在对象上创建并传递一个字符串，但是直到它到达转换步骤之前，都无法进行操作.

First off is the common code. In all cases I'm creating and passing a string on an object, but until it reaches the conversion steps there is no manipulation.

当前使用的utf-8字符串就是ĩ".

The currently utf-8 string being used is simply "ĩ".

为简单起见，我将在此代码中表示用作uniString的字符串

For the sake of simplicity I'll represent the string being used as uniString in this code

UErrorCode resultCode = U_ZERO_ERROR;

UConverter* m_pConv = ucnv_open("ISO-8859-1", &resultCode);

// Change the callback to error out instead of the default            
const void* oldContext;
UConverterFromUCallback oldFromAction;
UConverterToUCallback oldToAction;
ucnv_setFromUCallBack(m_pConv, UCNV_FROM_U_CALLBACK_STOP, NULL, &oldFromAction, &oldContext, &resultCode);
ucnv_setToUCallBack(m_pConv, UCNV_TO_U_CALLBACK_STOP, NULL, &oldToAction, &oldContext, &resultCode);

int32_t outputLength = 0;
int bodySize = uniString.length();
int targetSize = bodySize * 4;
char* target = new char[targetSize];                       

printf("Body: %s\n", uniString.c_str());
if (U_SUCCESS(resultCode))
{
    // outputLength = ucnv_convert("ISO-8859-1", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
    outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
        uniString.length(), &resultCode);
    ucnv_close(m_pConv);
}
printf("ISO-8859-1 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
    outputLength ? target : "invalid_char", resultCode, outputLength);

if (resultCode == U_INVALID_CHAR_FOUND || resultCode == U_ILLEGAL_CHAR_FOUND || resultCode == U_TRUNCATED_CHAR_FOUND)
{
    if (resultCode == U_INVALID_CHAR_FOUND)
    {
        printf("Unmapped input character, cannot be converted to Latin1");                    

        m_pConv = ucnv_open("UCS-2", &resultCode);
        if (U_SUCCESS(resultCode))
        {
            // outputLength = ucnv_convert("UCS-2", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
            outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
                uniString.length(), &resultCode);
            ucnv_close(m_pConv);
        }

        printf("UCS-2 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
            outputLength ? target : "invalid_char", resultCode, outputLength);

        if (U_SUCCESS(resultCode))
        {
            pdus = SegmentText(target, pText, SEGMENT_SIZE_UNICODE_MAX, true);
        }
    }
    else
    {
        printf("DecodeText(): Text contents does not appear to be valid UTF-8");
    }
}
else
{
    printf("DecodeText(): Text successfully converted to Latin1");
    std::string newBody(target, outputLength);
    pdus = SegmentText(newBody, pPdu, SEGMENT_SIZE_MAX);
}

问题是ucnv_fromAlgorithmic函数为ucs-2转换引发错误U_INVALID_CHAR_FOUND.这对于ISO-8859-1尝试有意义，但对ucs-2没有意义.

The problem is the ucnv_fromAlgorithmic function is throwing an error U_INVALID_CHAR_FOUND for the ucs-2 conversion. This makes sense for the ISO-8859-1 attempt, but not the ucs-2.

另一种尝试是使用ucnv_convert，您可以看到它已注释掉.该函数尝试进行转换，但没有像尝试的那样在ISO-8859-1尝试中失败.

The other attempt was to use ucnv_convert which you can see is commented out. This function attempted conversion, but didn't fail on the ISO-8859-1 attempt as it should.

问题是，有人有没有使用过这些功能的经验，并且看到关于此字符的转换假设有误吗?

So the question is, does anyone have experience with these function and see something incorrect or is there something incorrect about the assumption of conversion for this character?

使用ICU库将UTF-8转换为UCS-2 [英] UTF-8 to UCS-2 conversion with icu library

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

使用ICU库将UTF-8转换为UCS-2 [英] UTF-8 to UCS-2 conversion with icu library

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭