读取具有特殊字符(ISO-8859-1编码)的CDATA部分的问题 [英] problems reading CDATA section with special chars (ISO-8859-1 encoding)

查看:168
本文介绍了读取具有特殊字符(ISO-8859-1编码)的CDATA部分的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



这是有效的,但是我在阅读特殊字符时遇到困难。



例如如果我的xml看起来像这样

 <?xml version =1.0encoding =ISO-8859-1?> ; 
< persons>
< person>
< firstname>
<![CDATA [Sébastien]]>
< / firstname>
< lastname>
<![CDATA [Ørvåk]]>
< / lastname>
< / person>
< / persons>

我尝试使用linq阅读价值,如



元素(个人)选择p;

p $ p> var persons = from p in doc.Elements(persons)。
string firstname = person.Element(firstname)。
string lastname = person.Element(lastname)。

但在ØrvåkØ和å/Sébastien中,é提供了奇怪的字符。



有谁知道错了什么?我想它不使用编码ISO-8859-1。



感谢

解决方案



有两种可能性:


  1. 该文件真的编码为 UTF-8 ,但由xml解析器解释为 ISO-8859-1

  2. 该文件实际上编码为 ISO-8859-1 ,但正在被xml解析器解释为 UTF-8

要确定哪个是哪个,请查看éSébastien。有两种可能的想法:


  1. é成为两个不同的字符 - 可能é

  2. é成为一个废话特征或,并且可能的名称中也缺少 b Sébastien

在第一种情况下,您的文件不是您的想法。 (它正在以 UTF-8 数据的方式进入您的程序,但您的程序正在尝试将其解释为 ISO-8859-1 )使用十六进制编辑器或其他可以显示磁盘上的字节的xml文件查看。



在第二种情况下, d检查localhost上的HTTP服务器如何提供此文件。 (您的程序在 ISO-8859-1 格式中获取字节,但将其解释为 UTF-8 )在Windows上最简单的方法是打开一个 cmd 提示符,并运行命令: telnet localhost 80 / p>

当弹出窗口时,键入以下行(或从stackoverflow剪切并粘贴),然后按两次Enter。警告:您将无法看到您正在输入的内容,而大写字母很重要。

  GET / Test / person .xml HTTP / 1.0 

在响应中,寻找以内容类型。这将告诉你网络服务器本地如何提供文件。



更新:看了你的文件,真的是iso-8859 -1,所以我建议在设置 Webclient 实例的.Encoding属性之前,请先下载该文件:

  client.Encoding = System.Text.Encoding.GetEncoding(iso-8859-1)

或者,您可以使用 DownloadBytes 方法,而不是 DownloadString 方法,然后将这些字节解析成一个xml文件。目前的问题是,当xml解析器获取文件内容时,字节已经被解释为字符串,所以在那里更改编码太晚了。


I am trying to read a xml stream and load it into a collection.

This works but Im having difficulties reading special chars.

E.g. if my xml looks like this

<?xml version="1.0" encoding="ISO-8859-1" ?> 
<persons>
<person>
 <firstname>
 <![CDATA[ Sébastien ]]> 
  </firstname>
  <lastname>
   <![CDATA[Ørvåk]]> 
  </lastname>
</person>
</persons>

I try to read the values using linq like

var persons = from p in doc.Elements("persons").Elements("person") select p;
string firstname = person.Element("firstname").Value;
string lastname = person.Element("lastname").Value;

but in Ørvåk Ø and å / Sébastien the é gives strange chars.

Does anyone know whats wrong? I guess it doesnt use the encoding ISO-8859-1.

Thanks

解决方案

To expand on an answer someone else gave:

There are two possibilities:

  1. The file is really encoded as UTF-8, but is being interpreted by your xml parser as ISO-8859-1.
  2. The file is really encoded as ISO-8859-1 but is being interpreted by your xml parser as UTF-8.

To determine which is which, look at what happens with the é in Sébastien. There are two possibilities I can imagine:

  1. "é" becomes two different characters - probably "é"
  2. "é" becomes a single nonsense charact or "?", and possibly the "b" is also missing from the name Sébastien.

In the first case, your file is not what you think it is. (It is getting to your program as UTF-8 data, but your program is trying to interpret it as ISO-8859-1) Look at the xml file with a hex editor or something else that can show you what the bytes on the disk are.

In the second case, I'd check how the HTTP server on localhost is serving this file. (Your program is getting bytes in ISO-8859-1 format, but is interpreting them as UTF-8) The easiest way to do that on windows is to open up a cmd prompt, and run the command: telnet localhost 80

When that pops up a window, type the following line (or cut-and-paste from stackoverflow) and press enter twice. Warning: You won't be able to see what you're typing, and capitalization is important.

GET /Test/person.xml HTTP/1.0

In the response, look for a line beginning with Content-Type. That will tell you how the webserver locally is serving up the file.

Update: Having looked at your file, it really is iso-8859-1, so what I would suggest is setting the .Encoding attribute of your Webclient instance like so before you tell it to download the file:

client.Encoding = System.Text.Encoding.GetEncoding("iso-8859-1")

Alternatively, you could use the DownloadBytes methods instead of the DownloadString methods, and then parse the bytes into an xml file. The problem currently is that by the time the xml parser gets the file contents, the bytes have already been interpreted as a string, so it's too late to change the encoding there.

这篇关于读取具有特殊字符(ISO-8859-1编码)的CDATA部分的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆