处理包含“问号"的字符串时的编码问题. ( ) [英] Encoding issue when handling a string that contains "question mark" (�)
问题描述
我正在解析HttpWebRequest
的响应中的某些Web内容.
I am parsing some web content in a response from a HttpWebRequest
.
此Web内容正在使用字符集ISO-8859-1
,并且在解析它并最终从响应中获取所需的单词时,我收到带有问号的string
,例如�
,我想知道哪个是将其转换回可读的string
的正确方法.
This web content is using charset ISO-8859-1
and when parsing it and finally getting the word needed from the response, I am receiving a string
with a question mark like this �
and I want to know which is the right way to transform it back into a readable string
.
所以,我试图将当前单词encoding
转换为UTF-8
,如下所示:
So, what I've tried is to convert the current word encoding
into UTF-8
like this:
(我想知道UTF-8
是否可以解决我的问题)
(I am wondering if UTF-8
could solve my problem)
string word = "ESPA�OL";
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf = Encoding.GetEncoding("UTF-8");
byte[] isoBytes = iso.GetBytes(word);
byte[] utfBytes = Encoding.Convert(iso, utf, isoBytes);
string utfWord = utf.GetString(utfBytes);
Console.WriteLine(utfWord);
但是,utfWord
变量输出ESPA?OL
仍然是错误的.正确的输出应该是ESPAÑOL
.
However, utfWord
variable outputs ESPA?OL
which is still wrong. The correct output is supposed to be ESPAÑOL
.
如果可以的话,有人可以给我正确的方向来解决这个问题吗?
Can someone please give me the right directions to solve this, if possible?
推荐答案
有问题的单词是ESPAÑOL".这可以在ISO-8859-1中正确编码,因为单词中的所有字符都是以ISO表示-8859-1 .
The word in question is "ESPAÑOL". This can be encoded correctly in ISO-8859-1 since all characters in the word are represented in ISO-8859-1.
您可以使用以下简单程序亲自查看:
You can see this for yourself using the following simple program:
using System;
using System.Diagnostics;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Encoding enc = Encoding.GetEncoding("ISO-8859-1");
string original = "ESPAÑOL";
byte[] iso_8859_1 = enc.GetBytes(original);
string roundTripped = enc.GetString(iso_8859_1);
Debug.Assert(original == roundTripped);
Console.WriteLine(roundTripped);
}
}
}
这说明您需要正确诊断错误字符的来源.到您拥有一个字符时,已经为时已晚.该信息已丢失.字符的存在表明,在某些时候,已执行转换为不包含字符Ñ的字符集.
What this tells you is that you need to properly diagnose where the erroneous character comes from. By the time that you have a � character, it is too late. The information has been lost. The presence of the � character indicates that, at some point, a conversion was performed into a character set that did not contain the character Ñ.
从ISO-8859-1到Unicode编码的转换将正确处理ESPAÑOL",因为该单词可以在ISO-8859-1中进行编码.
A conversion from ISO-8859-1 to a Unicode encoding will correctly handle "ESPAÑOL" because that word can be encoded in ISO-8859-1.
最可能的解释是,在此过程中的某个地方,文本ESPAÑOL"被转换为不包含字母Ñ的字符集.
The most likely explanation is that somewhere along the way, the text "ESPAÑOL" is being converted to a character set that does not contain the letter Ñ.
这篇关于处理包含“问号"的字符串时的编码问题. ( )的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!