处理包含“问号"的字符串时的编码问题. ( ) [英] Encoding issue when handling a string that contains "question mark" (�)

查看:158
本文介绍了处理包含“问号"的字符串时的编码问题. ( )的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析HttpWebRequest的响应中的某些Web内容.

I am parsing some web content in a response from a HttpWebRequest.

此Web内容正在使用字符集ISO-8859-1,并且在解析它并最终从响应中获取所需的单词时,我收到带有问号的string,例如,我想知道哪个是将其转换回可读的string的正确方法.

This web content is using charset ISO-8859-1 and when parsing it and finally getting the word needed from the response, I am receiving a string with a question mark like this and I want to know which is the right way to transform it back into a readable string.

所以,我试图将当前单词encoding转换为UTF-8,如下所示:

So, what I've tried is to convert the current word encoding into UTF-8 like this:

(我想知道UTF-8是否可以解决我的问题)

(I am wondering if UTF-8 could solve my problem)

string word = "ESPA�OL";

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf = Encoding.GetEncoding("UTF-8");

byte[] isoBytes = iso.GetBytes(word);
byte[] utfBytes = Encoding.Convert(iso, utf, isoBytes);

string utfWord = utf.GetString(utfBytes);

Console.WriteLine(utfWord);

但是,utfWord变量输出ESPA?OL仍然是错误的.正确的输出应该是ESPAÑOL.

However, utfWord variable outputs ESPA?OL which is still wrong. The correct output is supposed to be ESPAÑOL.

如果可以的话,有人可以给我正确的方向来解决这个问题吗?

Can someone please give me the right directions to solve this, if possible?

推荐答案

有问题的单词是ESPAÑOL".这可以在ISO-8859-1中正确编码,因为单词中的所有字符都是以ISO表示-8859-1 .

The word in question is "ESPAÑOL". This can be encoded correctly in ISO-8859-1 since all characters in the word are represented in ISO-8859-1.

您可以使用以下简单程序亲自查看:

You can see this for yourself using the following simple program:

using System;
using System.Diagnostics;
using System.Text;

namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            Encoding enc = Encoding.GetEncoding("ISO-8859-1");
            string original = "ESPAÑOL";
            byte[] iso_8859_1 = enc.GetBytes(original);
            string roundTripped = enc.GetString(iso_8859_1);
            Debug.Assert(original == roundTripped);
            Console.WriteLine(roundTripped);
        }
    }
}

这说明您需要正确诊断错误字符的来源.到您拥有一个字符时,已经为时已晚.该信息已丢失.字符的存在表明,在某些时候,已执行转换为不包含字符Ñ的字符集.

What this tells you is that you need to properly diagnose where the erroneous character comes from. By the time that you have a � character, it is too late. The information has been lost. The presence of the � character indicates that, at some point, a conversion was performed into a character set that did not contain the character Ñ.

从ISO-8859-1到Unicode编码的转换将正确处理ESPAÑOL",因为该单词可以在ISO-8859-1中进行编码.

A conversion from ISO-8859-1 to a Unicode encoding will correctly handle "ESPAÑOL" because that word can be encoded in ISO-8859-1.

最可能的解释是,在此过程中的某个地方,文本ESPAÑOL"被转换为不包含字母Ñ的字符集.

The most likely explanation is that somewhere along the way, the text "ESPAÑOL" is being converted to a character set that does not contain the letter Ñ.

这篇关于处理包含“问号"的字符串时的编码问题. ( )的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆