为什么XmlTextReader的自动转换HTML连接codeD UTF8字符为utf8字符串? [英] why does xmltextreader convert html encoded utf8 characters to utf8 string automatically?
问题描述
我收到一个XML文件与编码ISO-8859-1(拉丁语 - 1)
I receive an XML file with encoding "ISO-8859-1" (Latin-1)
在文件中(其它标签中)我有< OtherText物实施例和放大器; QUOT;内容和放大器; QUOT;而与放大器;#9472;< / OtherText>
Within the file (among other tags) I have <OtherText>Example "content" And ─</OtherText>
现在由于某种原因,当我加载到这一点XMLTextReader的,并做了XmlReader.Value返回值,则返回:&QUOT;内容&QUOT;而─
Now for some reason when I load this into XMLTextReader and do a "XmlReader.Value" to return the value, it returns: "content" And ─
这则当与数据库只面对接受Latin-1编码,显然是错误的。
This then, when confronted with a database only accepting Latin-1 encoding, obviously errors.
我已经试过如下:
- 在转换成字节,并使用 Encoding.Convert从UTF-8更改 成拉丁文-1(其中成功 给我一串?代替)
- 使用 的StreamReader(文件编码。任何的) 将文件加载到XmlTextReader的
- Converting into bytes and using Encoding.Convert to change from UTF-8 into Latin-1 (which successfully gives me a bunch of "?" instead)
- Using StreamReader(file,Encoding.whatever) to load the file into XmlTextReader
和在互联网上和计算器istelf几个变化出现,和不同的方法。
And several variations there-of and different methods on the internet and on StackOverflow istelf.
据我所知,.NET字符串是UTF-16,但我不明白的是为什么,一个完全的Latin-1格式的XML文件,正确的标记时,UTF-8字符存在,这是与旧的数据库和兼容网页(为HTML标记等),它只是将覆盖和输出的使用UTF-8 EN codeD字符串反正。
I understand that .NET strings are UTF-16, but what I don't understand is why, a fully Latin-1 formatted XML file with CORRECT markup for when UTF-8 characters exist which is compatible with older databases AND the web (for HTML markup etc) that it simply overrides that and output's the UTF-8 encoded string ANYWAY.
有noway来解决这个问题除了写我自己的自定义文本分析器???
Is there noway to get around this other than writing my own custom text parser???
推荐答案
我不相信这是与编码的问题。你看到的是XML字符串中未逃脱了。
I do not believe this is a problem with the encoding. What you're seeing is the XML string being un-escaped.
现在的问题是&放大器; QUOT;
是一个XML转义字符,所以XMLTextReader的将未逃离这个给你。
The problem is "
is a XML escape character, so XMLTextReader will un-escape this for you.
如果你改变这一点:
<OtherText>Example "content" And ─</OtherText>
要这样:
<OtherText>Example &quot;content&quot; And &#9472;</OtherText>
然后
XmlReader.Value = ""content" And ─";
您需要包装在CDATA你的价值,因此被解析器忽略。
You'll need to wrap your value in CDATA so it is ignored by the parser.
另一种选择是重新转义字符串:
Another option is to re-escape the string:
using System.Security;
....
....
string val = SecurityElement.Escape(xmlReader.Value);
这篇关于为什么XmlTextReader的自动转换HTML连接codeD UTF8字符为utf8字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!