如何将 UTF-8 字符串转换为 Unicode? [英] How to convert a UTF-8 string into Unicode?

查看:25
本文介绍了如何将 UTF-8 字符串转换为 Unicode?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有显示 UTF-8 编码字符的字符串,我想将其转换回 Unicode.

I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.

目前,我的实现如下:

public static string DecodeFromUtf8(this string utf8String)
{
    // read the string as UTF-8 bytes.
    byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);

    // convert them into unicode bytes.
    byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);

    // builds the converted string.
    return Encoding.Unicode.GetString(encodedBytes);
}

我在玩déjà"这个词.我已经通过这个在线工具把它转换成UTF-8,所以我开始测试我的方法使用字符串 "déjÃ".

I am playing with the word "déjà". I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "déjÃ".

不幸的是,在这个实现中,字符串保持不变.

Unfortunately, with this implementation the string just remains the same.

我哪里错了?

推荐答案

所以问题是 UTF-8 代码单元值已作为 16 位代码单元的序列存储在 C# string.您只需验证每个代码单元是否在一个字节的范围内,将这些值复制到字节中,然后将新的 UTF-8 字节序列转换为 UTF-16.

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.

public static string DecodeFromUtf8(this string utf8String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];
    }

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("du00C3u00A9ju00C3u00A0"); // déjà

这很容易,但最好找到根本原因;有人将 UTF-8 代码单元复制到 16 位代码单元的位置.可能的罪魁祸首是有人使用错误的编码将字节转换为 C# string.例如.Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).

This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).

或者,如果您确定知道用于生成字符串的错误编码,并且该错误编码转换是无损的(通常情况下,如果错误编码是单字节编码),那么您可以简单地执行逆编码步骤得到原始UTF-8数据,然后你可以从UTF-8字节做正确的转换:

Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
    // the inverse of `mistake.GetString(originalBytes);`
    byte[] originalBytes = mistake.GetBytes(mangledString);
    return correction.GetString(originalBytes);
}

UndoEncodingMistake("du00C3u00A9ju00C3u00A0", Encoding(1252), Encoding.UTF8);

这篇关于如何将 UTF-8 字符串转换为 Unicode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆