如何将UTF-8字符串转换成统一code转换? [英] How to convert a UTF-8 string into Unicode?

查看:142
本文介绍了如何将UTF-8字符串转换成统一code转换?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个显示UTF-8 EN codeD字符的字符串,我想将其转换回统一code。

I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.

现在,我的实现如下:

public static string DecodeFromUtf8(this string utf8String)
{
    // read the string as UTF-8 bytes.
    byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);

    // convert them into unicode bytes.
    byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);

    // builds the converted string.
    return Encoding.Unicode.GetString(encodedBytes);
}

我玩字似曾相识。我曾经使用过此在线工具将其转换为UTF-8,所以我开始来测试我的方法与字符串DA©JA

I am playing with the word "déjà". I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "déjÃ".

不幸的是,在本实施字符串只是保持相同。

Unfortunately, with this implementation the string just remains the same.

我在哪里错了?

推荐答案

所以,问题是,UTF-8 code单元值已被存储为16位code单位在C#中的序列字符串。您只需验证每个code单位是字节的范围内,这些值复制到字节,然后转换成新的UTF-8字节序列为UTF-16。

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.

public static string DecodeFromUtf8(this string utf8String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];
    }

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

这是容易的,但是这将是最好找的根本原因;如果有人正在复制UTF-8 code单位为16位code单位的位置。可能的罪魁祸首是谁字节转换成C#字符串使用了错误的编码。例如。 Encoding.Default.GetString(utf8Bytes,0,utf8Bytes.Length)

This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).

这篇关于如何将UTF-8字符串转换成统一code转换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆