为什么XmlTextReader的自动转换HTML连接codeD UTF8字符为utf8字符串? [英] why does xmltextreader convert html encoded utf8 characters to utf8 string automatically?

查看:251
本文介绍了为什么XmlTextReader的自动转换HTML连接codeD UTF8字符为utf8字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我收到一个XML文件与编码ISO-8859-1(拉丁语 - 1)

I receive an XML file with encoding "ISO-8859-1" (Latin-1)

在文件中(其它标签中)我有< OtherText物实施例和放大器; QUOT;内容和放大器; QUOT;而与放大器;#9472;< / OtherText>

Within the file (among other tags) I have <OtherText>Example &quot;content&quot; And &#9472;</OtherText>

现在由于某种原因,当我加载到这一点XMLTextReader的,并做了XmlReader.Value返回值,则返回:&QUOT;内容&QUOT;而─

Now for some reason when I load this into XMLTextReader and do a "XmlReader.Value" to return the value, it returns: "content" And ─

这则当与数据库只面对接受Latin-1编码,显然是错误的。

This then, when confronted with a database only accepting Latin-1 encoding, obviously errors.

我已经试过如下:

  • 在转换成字节,并使用 Encoding.Convert从UTF-8更改 成拉丁文-1(其中成功 给我一串?代替)
  • 使用 的StreamReader(文件编码。任何的) 将文件加载到XmlTextReader的
  • Converting into bytes and using Encoding.Convert to change from UTF-8 into Latin-1 (which successfully gives me a bunch of "?" instead)
  • Using StreamReader(file,Encoding.whatever) to load the file into XmlTextReader

和在互联网上和计算器istelf几个变化出现,和不同的方法。

And several variations there-of and different methods on the internet and on StackOverflow istelf.

据我所知,.NET字符串是UTF-16,但我不明白的是为什么,一个完全的Latin-1格式的XML文件,正确的标记时,UTF-8字符存在,这是与旧的数据库和兼容网页(为HTML标记等),它只是将覆盖和输出的使用UTF-8 EN codeD字符串反正。

I understand that .NET strings are UTF-16, but what I don't understand is why, a fully Latin-1 formatted XML file with CORRECT markup for when UTF-8 characters exist which is compatible with older databases AND the web (for HTML markup etc) that it simply overrides that and output's the UTF-8 encoded string ANYWAY.

有noway来解决这个问题除了写我自己的自定义文本分析器???

Is there noway to get around this other than writing my own custom text parser???

推荐答案

我不相信这是与编码的问题。你看到的是XML字符串中未逃脱了。

I do not believe this is a problem with the encoding. What you're seeing is the XML string being un-escaped.

现在的问题是&放大器; QUOT; 是一个XML转义字符,所以XMLTextReader的将未逃离这个给你。

The problem is &quot; is a XML escape character, so XMLTextReader will un-escape this for you.

如果你改变这一点:

<OtherText>Example &quot;content&quot; And &#9472;</OtherText>

要这样:

<OtherText>Example &amp;quot;content&amp;quot; And &amp;#9472;</OtherText>

然后

   XmlReader.Value = "&quot;content&quot; And &#9472;";

您需要包装在CDATA你的价值,因此被解析器忽略。

You'll need to wrap your value in CDATA so it is ignored by the parser.

另一种选择是重新转义字符串:

Another option is to re-escape the string:

    using System.Security;
....
....
    string val = SecurityElement.Escape(xmlReader.Value);

这篇关于为什么XmlTextReader的自动转换HTML连接codeD UTF8字符为utf8字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆