如何将 UTF-8 字符串转换为 Unicode? [英] How to convert a UTF-8 string into Unicode?
问题描述
我有显示 UTF-8 编码字符的字符串,我想将其转换回 Unicode.
I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.
目前,我的实现如下:
public static string DecodeFromUtf8(this string utf8String)
{
// read the string as UTF-8 bytes.
byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);
// convert them into unicode bytes.
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);
// builds the converted string.
return Encoding.Unicode.GetString(encodedBytes);
}
我在玩déjà"
这个词.我已经通过这个在线工具把它转换成UTF-8,所以我开始测试我的方法使用字符串 "déjÃ"
.
I am playing with the word "déjà"
. I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "déjÃ"
.
不幸的是,在这个实现中,字符串保持不变.
Unfortunately, with this implementation the string just remains the same.
我哪里错了?
推荐答案
所以问题是 UTF-8 代码单元值已作为 16 位代码单元的序列存储在 C# string
.您只需验证每个代码单元是否在一个字节的范围内,将这些值复制到字节中,然后将新的 UTF-8 字节序列转换为 UTF-16.
So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string
. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.
public static string DecodeFromUtf8(this string utf8String)
{
// copy the string as UTF-8 bytes.
byte[] utf8Bytes = new byte[utf8String.Length];
for (int i=0;i<utf8String.Length;++i) {
//Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
utf8Bytes[i] = (byte)utf8String[i];
}
return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}
DecodeFromUtf8("du00C3u00A9ju00C3u00A0"); // déjà
这很容易,但最好找到根本原因;有人将 UTF-8 代码单元复制到 16 位代码单元的位置.可能的罪魁祸首是有人使用错误的编码将字节转换为 C# string
.例如.Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length)
.
This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string
using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length)
.
或者,如果您确定知道用于生成字符串的错误编码,并且该错误编码转换是无损的(通常情况下,如果错误编码是单字节编码),那么您可以简单地执行逆编码步骤得到原始UTF-8数据,然后你可以从UTF-8字节做正确的转换:
Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:
public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
// the inverse of `mistake.GetString(originalBytes);`
byte[] originalBytes = mistake.GetBytes(mangledString);
return correction.GetString(originalBytes);
}
UndoEncodingMistake("du00C3u00A9ju00C3u00A0", Encoding(1252), Encoding.UTF8);
这篇关于如何将 UTF-8 字符串转换为 Unicode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!