libxml2 xmlChar *到std :: wstring [英] libxml2 xmlChar * to std::wstring

查看:76
本文介绍了libxml2 xmlChar *到std :: wstring的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

libxml2似乎将其所有字符串存储为xmlChar *在UTF-8中.

libxml2 seems to store all its strings in UTF-8, as xmlChar *.

/**
 * xmlChar:
 *
 * This is a basic byte in an UTF-8 encoded string.
 * It's unsigned allowing to pinpoint case where char * are assigned
 * to xmlChar * (possibly making serialization back impossible).
 */
typedef unsigned char xmlChar;

由于libxml2是C库,因此没有提供从xmlChar *中获取std::wstring的例程.我想知道在C ++ 11中将xmlChar *转换为std::wstring谨慎方法是否使用

As libxml2 is a C library, there's no provided routines to get an std::wstring out of an xmlChar *. I'm wondering whether the prudent way to convert xmlChar * to a std::wstring in C++11 is to use the mbstowcs C function, via something like this (work in progress):

std::wstring xmlCharToWideString(const xmlChar *xmlString) {
    if(!xmlString){abort();} //provided string was null
    int charLength = xmlStrlen(xmlString); //excludes null terminator
    wchar_t *wideBuffer = new wchar_t[charLength];
    size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength);
    if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed
    std::wstring wideString(wideBuffer, wcharLength);
    delete[] wideBuffer;
    return wideString;
}

编辑:仅供参考,我非常了解xmlStrlen返回的内容;这是用于存储字符串的xmlChar的数字;我知道这不是字符的数目,而是unsigned char的数目.如果我将其命名为byteLength,本来就不会那么混乱,但是我认为,由于同时拥有charLengthwcharLength,它会更加清晰.至于代码的正确性,wideBuffer总是大于或等于到保持缓冲区所需的大小(我相信).因为需要的空间比wide_t大的字符将被截断(我认为).

Just an FYI, I'm very aware of what xmlStrlen returns; it's the number of xmlChar used to store the string; I know it's not the number of characters but rather the number of unsigned char. It would have been less confusing if I had named it byteLength, but I thought it would have been clearer as I have both charLength and wcharLength. As for the correctness of the code, the wideBuffer will be larger or equal to the required size to hold the buffer, always (I believe). As characters that require more space than wide_t will be truncated (I think).

推荐答案

xmlStrlen()返回xmlChar*字符串中UTF-8编码的代码单元的数量.这将与转换数据时所需的wchar_t编码代码单元数量不同,因此请不要使用xmlStrlen()分配wchar_t字符串的大小.您需要调用 std::mbtowc() 一次,以获取正确的长度,然后分配内存,然后再次调用mbtowc()以填充内存.您还必须使用 std::setlocale() 告诉mbtowc()使用UTF. -8(使用语言环境可能不是一个好主意,尤其是在涉及多个线程的情况下).例如:

xmlStrlen() returns the number of UTF-8 encoded codeunits in the xmlChar* string. That is not going to be the same number of wchar_t encoded codeunits needed when the data is converted, so do not use xmlStrlen() to allocate the size of your wchar_t string. You need to call std::mbtowc() once to get the correct length, then allocate the memory, and call mbtowc() again to fill the memory. You will also have to use std::setlocale() to tell mbtowc() to use UTF-8 (messing with the locale may not be a good idea, especially if multiple threads are involved). For example:

std::wstring xmlCharToWideString(const xmlChar *xmlString)
{    
    if (!xmlString) { abort(); } //provided string was null

    std::wstring wideString;

    int charLength = xmlStrlen(xmlString);
    if (charLength > 0)
    {
        char *origLocale = setlocale(LC_CTYPE, NULL);
        setlocale(LC_CTYPE, "en_US.UTF-8");

        size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator
        if (wcharLength != (size_t)(-1))
        {
            wideString.resize(wcharLength);
            mbtowc(&wideString[0], (const char*) xmlString, charLength);
        }

        setlocale(LC_CTYPE, origLocale);
        if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed
    }

    return wideString;
}

自从您提到C ++ 11以来,一个更好的选择是将std::codecvt_utf8std::wstring_convert结合使用,这样您就不必处理语言环境:

A better option, since you mention C++11, is to use std::codecvt_utf8 with std::wstring_convert instead so you do not have to deal with locales:

std::wstring xmlCharToWideString(const xmlChar *xmlString)
{    
    if (!xmlString) { abort(); } //provided string was null
    try
    {
        std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv;
        return conv.from_bytes((const char*)xmlString);
    }
    catch(const std::range_error& e)
    {
        abort(); //wstring_convert failed
    }
}

另一种选择是使用实际的Unicode库(例如ICU或ICONV)来处理Unicode转换.

An alternative option is to use an actual Unicode library, such as ICU or ICONV, to handle Unicode conversions.

这篇关于libxml2 xmlChar *到std :: wstring的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆