如何将utf8转换为std :: string? [英] how to convert utf8 to std::string?

查看:130
本文介绍了如何将utf8转换为std :: string?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理此代码,该代码接收到一个cpprest sdk响应,该响应包含一个base64_encoded有效负载,它是一个json。这是我的代码段:

I am working on this code which receives a cpprest sdk response containing a base64_encoded payload which is a json. here is my code snippet:

typedef std::wstring string_t; //defined in basic_types.h in cpprest lib
    void demo() {
        http_response response; 
        //code to handle respose ...
        json::value output= response.extract_json();
        string_t payload = output.at(L"payload").as_string();
        vector<unsigned char> base64_encoded_payload = conversions::from_base64(payload);
        std::string utf8_payload(base64_encoded_payload.begin(), base64_encoded_payload.end()); //in debugger I see the Japanese chars are garbled.
        string_t utf16_payload = utf8_to_utf16(utf8_payload); //in debugger I see the Japanese chars are good here
        //then I need to process the utf8_payload which is an xml.
        //I have an API available to process the xml which takes an string
        processXML(utf16_payload); //need to convert utf16_payload to a string here;

    }

我也尝试过这样做,我发现str包含乱码!

I also tried this and I see str contains garbled chars!

#include <codecvt>  // for codecvt_utf8_utf16
#include <locale>   // for wstring_convert
#include <string>   // for string, wstring
void wstr2str(void) {
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> conversion;
    std::wstring japanese = L"北島 美奈";
    std::string str = conversion.to_bytes(japanese); //str is garbled:(
}

我的问题是:utf8是否可以包含日语字符可以转换为std :: string而不会出现乱码?

my questions is: can utf8 containing Japanese char be converted to std::string without being garbled?

更新::我可以访问 processXML()代码,然后将输入参数类型更改为std :: wstring,并且可以正常工作。
我想出了在创建xml时,它将std :: string转换为wstring的方法;但是,效果并不理想! / p>

Update: I gained access to the processXML() code and changed the input argument type to std::wstring and it worked. I figured when the xml was getting created, it was converting the std::string to wstring; however, it was not turning out good!

void processXML(std::wstring xmlStrBuf) { //chaned xmlStrBuf to wstring and worked
// more code
CComBSTR xmlBuff = xmlStrBuf.c_str(); 
VARIANT_BOOL bSuccess = false;
xmlDoc->loadXML(xmlBuff, &bSuccess);
//more code

}

感谢您的回答,它们对您有所帮助提到的字符串只是一个存储空间。

Thanks for the answers and they were helpful when mentioned the string is only a storage.

推荐答案

您在这里混淆了不同的概念。

You are confusing different concepts here.

存储年龄

这是我们保存/存储/保存数据的方式。 std :: string char s的集合,它们是 bytes std :: wstring wchar_t s的集合,有时值是2字节宽(但这是

This is how we save/store/hold our data. A std::string is a collection of chars, which are bytes. A std::wstring is a collection of wchar_ts, which are sometimes 2-byte wide value (but this is not guaranteed!).

编码

这是数据方法,以及应如何解释。 std :: string (字节的集合)可以容纳UTF-8,UTF-16,UTF-32,ASCII,ShiftJIS或莫尔斯电码,

This is what the data means, and how it should be interpreted. A std::string, a collection of bytes, could hold UTF-8, or UTF-16, or UTF-32, or ASCII, or ShiftJIS, or morse code, or a JPEG, or a movie, or my DNA (lucky string!).

世界上有一些很强的约定。例如,在Windows上,通常接受 std :: wstring 来保存UTF-16(因为两字节存储对此很方便,并且因为这就是Windows API可以做到这一点。)

There are some strong conventions in play in the world. For example, on Windows, a std::wstring is generally accepted to hold UTF-16 (because the two-byte storage is convenient for this, and also because that's how the Windows API does it).

较新版本的C ++为我们提供了 std :: u16_string std :: u32_string 仍然不直接 具有任何编码概念,但分别打算用于UTF-16和UTF-32因为它们的名称使该意图对代码读者更加明显。 C ++ 20将引入 std :: u8_string ,其目的是表示UTF-8编码的字符串(否则或多或少类似于 std :: string )。

Newer versions of C++ give us things like std::u16_string and std::u32_string as well, which still do not directly have any notion of encoding, but are intended to be used for UTF-16 and UTF-32 respectively because their names make that intention more obvious to readers of code. C++20 will introduce std::u8_string which is intended to signify a UTF-8 encoded string (and is otherwise more or less like a std::string).

但这只是约定 std :: string 类型什么也没有说 UTF-8或其他任何东西。它不了解,不关心或不执行任何编码。它只是存储字节。

But these are just conventions. Nothing about the type std::string says "UTF-8" or any other thing. It doesn't know about or care about or enforce any encoding. It just stores bytes.

所以,您有关将UTF-8转换为 std :: string 的问题确实没有任何意义;

So, your question about "converting UTF-8 to std::string" does not really make any sense; it's like asking how to convert a road into a car.

那我该怎么办?

好吧,Base64也不是编码。嗯,实际上,完全是,但这是在字符串编码顶部的 编码。这是一种传输/转义/清除原始字节的方法,而不是描述以后如何解释它们的方法。通过要求cpprest从Base64转换,这只是在改变原始字节的方式提供。这就是为什么它为您提供 std :: vector< char> 而不是 std :: string 的原因,尽管(如上所述) std :: string 并不关心编码,我们有时会使用 std :: vector< char> 完全正确地说:此集合没有任何特定的编码,所以请不要试图从惯例或此用例中的任何编码中猜测;它所知道的只是一堆个字节。这取决于意见。某些人仍然会使用 std :: string 来实现;

Well, Base64 is also not an encoding. Well, actually, it totally is, but it's an encoding on top of the string encoding. It's a way of transmitting/escaping/sanitising the raw bytes, not a way of describing how to interpret them later. By asking cpprest to convert from Base64, that's just transforming the way the raw bytes are provided. That's why it gives you a std::vector<char> rather than a std::string because, although (as discussed above) std::string doesn't care about encoding, we sometimes use a std::vector<char> to really, properly, completely say that "this collection does not have any particular encoding, so please don't try to guess from convention or whatever what the encoding is in this use case; all it knows is that it is a bunch of bytes". This is down to opinion. Some people will still use a std::string for that; the authors of cpprest decided not to.

重点是使用函数 from_base64 不能告诉我们任何信息关于您检索的文本的编码。为此,我们必须返回文本文档。我们无权访问,您也没有告诉我们任何信息。如果只是JSON字符串,则编码将取决于cpprest JSON库,因此您已经完成了。但是,事实并非如此:创建JSON对象的人都会将其打包到Base64表示中。同样,这些信息不是您与我们共享的。

The point is that the use of the function from_base64 cannot tell us anything about the encoding of the text that you've retrieved. For that, we have to go back to the documentation for the text. We have no access to that, and you did not tell us anything about it. If it were just a JSON string, the encoding would be down to the cpprest JSON library and so you'd already be done. However, it's not: it's something packed into a Base64 representation by whoever created the JSON object. Again, that information is not something that you shared with us.

但是,根据您选择的变量名称,您正在查看的数据已经是UTF-8 。然后,您尝试将其转换为UTF-16,这与您想要描述的相反。

But, based on the variable names you've chosen, the data you're looking at is already UTF-8. You've then attempted to convert it to UTF-16, which is rather the opposite of what you've described you wanted to do.

(类似地,例如,您已经使用了[a]已存储的 a std :: wstring UTF-16多亏了 L宽字符串文字 ,然后告诉计算机它是UTF-8,并再次将其转换为UTF-16 ,然后将原始字节提取到 std :: string 中。这都没有道理。)

(Similarly, in your second example, you've taken a std::wstring that [probably] already stores UTF-16 thanks to the L"wide string literal", then told the computer that it's UTF-8 and to convert it "again" to UTF-16, then extracted the raw bytes into a std::string. None of that makes sense.)

为什么不从字面上只是 processXML(utf8_payload);

Instead, why not literally just processXML(utf8_payload);?

一般建议

编码非常复杂,但是一旦您将所有这些抽象层的基本概念都考虑在内,处理起来就容易得多。对于未来,以及对于这个问题,如果您想澄清一下,您将需要确保在数据流水线从位置A传输到位置B并到达位置B的每个阶段都绝对清楚从类型C转换为类型D,以及其他方式,说明在每个步骤中其应采用的编码方式。如果要在其中一个步骤中更改编码,请执行此操作(尽管这种情况很少见!)。但是在编写任何代码之前,请确保已确定所需的内容,否则您将陷入困境。

Encoding can be quite complex, although it's significantly easier to deal with once you've wrapped your mind around the basic concepts of all these layers of abstraction. For the future, and for this question if you wish to clarify it, you will need to ensure that you are absolutely clear, at each stage of the "pipeline" of your data as it gets transmitted from place A to place B, and gets converted from type C to type D, and whatever else, about what encoding it should be at each of those steps. If you want to change the encoding at one of those steps, then do so (though this should be rare!). But before you write any code make sure that you know for sure what it is that you need, otherwise you'll get yourself in a massive tangle.

最终,您将不过,开始发现有帮助的模式。例如,如果您期望得到一些美味的非ASCII输出,而是看到其中包含很多Å 字符的奇怪文本,则可能是UTF-8,它被错误地解释为ASCII。这是因为这样的方式,即表示UTF-8中大于一个字节的Unicode代码点的特殊序列通常以其数值与ASCII中的字母Å 相同的字节开头( ,ISO / IEC 8859,但足够接近)。

Eventually you'll start to detect patterns that can help, though. For example, if you were expecting some delicious non-ASCII output and instead see strange text with lots of "Å" characters in it, that's probably UTF-8 that's being interpreted as ASCII by mistake. That's because of the way that the special sequence denoting Unicode codepoints larger than one byte in UTF-8 often starts with a byte whose numerical value is the same as that of the letter "Å" in ASCII (well, ISO/IEC 8859, but close enough).

类似地,如果您会日语并且没想到,根据我的经验,通常是因为您给了计算机告诉他们这是UTF-16编码的字符串,而实际上是UTF-8。您只要在工作时就更加了解这些模式,就可以帮助您更快地修复错误。

Similarly, if you get Japanese and didn't expect it, in my experience that's usually because you've given the computer some bytes and told it that they are a string in UTF-16 encoding, when actually they were UTF-8. You just get more experienced at recognising these patterns as you work more, and it can help you to fix your bugs faster.

上周,上一个示例为我省了不少钱有点时间:我立即知道我的源数据必须是UTF-8,因此能够快速决定将字节副本删除到 std :: wstring 我一直在尝试。以与编码无关的方式检查字节也显示了Å 模式,然后就是那样。这很重要,因为我没有有关数据源的文档,因此也没有办法只查找应该的编码。我不得不猜测/推断。希望这里不会对您如此。

Just last week the last example there saved me quite a bit of time: I knew immediately that my source data must have been UTF-8, and was therefore able to quickly decide to remove the byte-copy into a std::wstring that I'd been attempting. Examining the bytes in an encoding-agnostic way revealed the "Å" pattern as well and then that was that. This was important because I had no documentation for the data source and thus no way to just look up what the encoding was supposed to be. I had to guess/deduce it. Hopefully that won't be the case for you here.

这篇关于如何将utf8转换为std :: string?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆