将窄字符串转换为宽字符串时,为什么要用0xFF屏蔽字符? [英] Why mask a char with 0xFF when converting narrow string to wide string?
问题描述
将此功能考虑为将窄字符串转换为宽字符串:
std::wstring convert(const std::string& input)
{
try
{
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
return converter.from_bytes(input);
}
catch(std::range_error& e)
{
std::size_t length = input.length();
std::wstring result;
result.reserve(length);
for(std::size_t i = 0; i < length; i++)
{
result.push_back(input[i] & 0xFF);
}
return result;
}
}
我很难理解以下表达式的必要性后备路径:
I am having difficulty understanding the need for this expression in the fallback path:
result.push_back(input[i] & 0xFF);
为什么字符串中的每个字符都被0xFF(0b11111111)屏蔽?
Why is each character in the string being masked with 0xFF (0b11111111)?
推荐答案
使用 0xFF
进行伪装会将所有负值减少到0-255范围内。
Masking with 0xFF
reduces any negative values into the range 0-255.
例如,如果您平台的 char
是表示ISO-8859-1字符的8位带符号类型,这是合理的,并且您的 wchar_t
代表UCS-2,UTF-16或UCS-4。
This is reasonable if, for example, your platform's char
is an 8-bit signed type representing ISO-8859-1 characters, and your wchar_t
is representing UCS-2, UTF-16 or UCS-4.
未经此更正(或类似的操作,例如转换为 unsigned char
或 std :: byte
),当您将字符提升为较宽的类型时,您会发现字符会进行符号扩展。
Without this correction (or something similar, such as casting to unsigned char
or std::byte
), you would find that characters are sign-extended when promoted to the wider type.
示例:0xa9(©
在Unicode和Latin-1中,-87在带符号的8位中)将变为 \uffa9
而不是 \u00a9
。
Example: 0xa9 (©
in Unicode and Latin-1, -87 in signed 8-bit) would become \uffa9
instead of \u00a9
.
我认为转换 char
转换为无符号字符
-适用于任何大小的字符,并更好地传达意图。您可以直接更改该表达式,也可以创建一个 codecvt
子类,该子类为正在执行的操作提供名称。
I think it's clearer to convert the char
to an unsigned char
- that works for any size char, and conveys the intent better. You can change that expression directly, or create a codecvt
subclass that gives a name to what you're doing.
以下是编写和使用最小的 codecvt
的方法(仅适用于窄→宽转换):
Here's how to write and use a minimal codecvt
(for narrow → wide conversion only):
#include <codecvt>
#include <locale>
#include <string>
class codecvt_latin1 : public std::codecvt<wchar_t,char,std::mbstate_t>
{
protected:
virtual result do_in(std::mbstate_t&,
const char* from,
const char* from_end,
const char*& from_next,
wchar_t* to,
wchar_t* to_end,
wchar_t*& to_next) const override
{
while (from != from_end && to != to_end)
*to++ = (unsigned char)*from++;
from_next = from;
to_next = to;
return result::ok;
}
};
std::wstring convert(const std::string& input)
{
using codecvt_utf8 = std::codecvt_utf8<wchar_t>;
try {
return std::wstring_convert<codecvt_utf8>().from_bytes(input);
} catch (std::range_error&) {
return std::wstring_convert<codecvt_latin1>{}.from_bytes(input);
}
}
#include <iostream>
int main()
{
std::locale::global(std::locale{""});
// UTF-8: £© おはよう
std::wcout << convert(u8"\xc2\xa3\xc2\xa9 おはよう") << std::endl;
// Latin-1: 壩
std::wcout << convert("\xc2\xa3\xa9") << std::endl;
}
输出:
£© おはよう
壩
这篇关于将窄字符串转换为宽字符串时,为什么要用0xFF屏蔽字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!