如何将UTF-8字符串转换成统一code转换? [英] How to convert a UTF-8 string into Unicode?
问题描述
我有一个显示UTF-8 EN codeD字符的字符串,我想将其转换回统一code。
I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.
现在,我的实现如下:
public static string DecodeFromUtf8(this string utf8String)
{
// read the string as UTF-8 bytes.
byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);
// convert them into unicode bytes.
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);
// builds the converted string.
return Encoding.Unicode.GetString(encodedBytes);
}
我玩字似曾相识
。我曾经使用过此在线工具将其转换为UTF-8,所以我开始来测试我的方法与字符串DA©JA
。
I am playing with the word "déjà"
. I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "déjÃ"
.
不幸的是,在本实施字符串只是保持相同。
Unfortunately, with this implementation the string just remains the same.
我在哪里错了?
推荐答案
所以,问题是,UTF-8 code单元值已被存储为16位code单位在C#中的序列字符串
。您只需验证每个code单位是字节的范围内,这些值复制到字节,然后转换成新的UTF-8字节序列为UTF-16。
So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string
. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.
public static string DecodeFromUtf8(this string utf8String)
{
// copy the string as UTF-8 bytes.
byte[] utf8Bytes = new byte[utf8String.Length];
for (int i=0;i<utf8String.Length;++i) {
//Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
utf8Bytes[i] = (byte)utf8String[i];
}
return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}
DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà
这是容易的,但是这将是最好找的根本原因;如果有人正在复制UTF-8 code单位为16位code单位的位置。可能的罪魁祸首是谁字节转换成C#字符串
使用了错误的编码。例如。 Encoding.Default.GetString(utf8Bytes,0,utf8Bytes.Length)
。
This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string
using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length)
.
这篇关于如何将UTF-8字符串转换成统一code转换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!