与解析XML符号 [英] parsing XML with ampersand

查看:185
本文介绍了与解析XML符号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含XML字符串,我只是想分析到的XElement,但它有一个符号。我仍然有问题HtmlDe code解析它。任何建议?

 字符串测试=<与myXML>< SubXML>< XmlEntry元=测试值=哇&安培;/>< / SubXML>< /与myXML&GT ;;XElement.Parse(HttpUtility.HtmlDe code(测试));

我还添加了这些方法来取代那些字符,但我仍然得到XMLException。

 字符串连接codedXml = test.Replace(与&,&放大器;放大器;)替换。(<,&放大器; LT;) .Replace(>中,与& gt;中)。替换(\\,与& QUOT;)。替换(',与&者;);
的XElement与myXML = XElement.Parse(EN codedXml);

T
或者即使有这种尝试过:

 字符串newContent = SecurityElement.Escape(试验);
的XElement与myXML = XElement.Parse(newContent);


解决方案

理想情况下,XML是正确转义之前,您code消费它。如果这是你无法控制的,你可以写一个正则表达式。除非你是绝对肯定的值不包含其他逃脱的项目不要使用与string.replace方法。

例如,哇&放大器;放大器;替换(与&,&放大器;放大器;)的结果哇&安培;放大器;放大器; 这显然是不可取

Regex.Replace可以给你更多的控制,以避免这种情况,可以书面,只匹配&放大器;不属于其他字符的一部分符号,如&放大器; LT; ,是这样的:

 字符串结果= Regex.Replace(测试,&放大器;?!((AMP |者| QUOT | LT | GT);),&放大器;放大器;);

以上的作品,但无可否认它不包括各种各样的,以与符号启动其他字符,如&放大器; NBSP; 和值列表也会增长。

一个更灵活的方法是去code值属性的内容,然后重新连接code吧。如果你有值=&放大器;哇&放大器;放大器;去code程序将返回&放大器;哇&安培;然后重新编码,它将返回&放大器;放大器;哇&放大器;放大器;,这是可取的。拉这一关,你可以这样做:

 字符串结果= Regex.Replace(测试,@值= \\(*)\\,M =>中。?值= \\+
    HttpUtility.HtmlEn code(HttpUtility.HtmlDe code(m.Groups [1] .value的))+
    \\);
VAR DOC = XElement.Parse(结果);

请记住,上述正则表达式只指定值属性的内容。如果在XML结构,从同一个问题遭受然后其他领域也可以调整,以匹配它们并替换以类似的方式其内容。



修改更新的解决方案应该处理双引号之间的任何标记之间的内容以及。一定要彻底测试。试图操纵XML / HTML标记与正则表达式是不是有利的,因为它可以容易出错和过于复杂。因为你需要,以利用它先消毒它你的情况有些特殊。

 字符串模式= \"(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\\\")(?<content>.+?)(?<end>\\\")\";
字符串结果= Regex.Replace(测试,模式,M =&GT;
    m.Groups [开始]。值+
    HttpUtility.HtmlEn code(HttpUtility.HtmlDe code(m.Groups [内容。值))+
    m.Groups [结束]值)。
VAR DOC = XElement.Parse(结果);

I have a string which contains XML, I just want to parse into Xelement, but it has an ampersand. I still have problem to parse it with HtmlDecode. Any suggestion?

string test = " <MyXML><SubXML><XmlEntry Element="test" value="wow&" /></SubXML></MyXML>"; 

XElement.Parse(HttpUtility.HtmlDecode(test));

I also added these methods to replace those characters, but I am still getting XMLException.

string encodedXml = test.Replace("&", "&amp;").Replace("<", "&lt;").Replace(">", "&gt;").Replace("\"", "&quot;").Replace("'", "&apos;");
XElement myXML = XElement.Parse(encodedXml);

t or Even tried it with this:

string newContent=  SecurityElement.Escape(test);
XElement myXML = XElement.Parse(newContent);

解决方案

Ideally the XML is escaped properly prior to your code consuming it. If this is beyond your control you could write a regex. Do not use the String.Replace method unless you're absolutely sure the values do not contain other escaped items.

For example, "wow&amp;".Replace("&", "&amp;") results in wow&amp;amp; which is clearly undesirable.

Regex.Replace can give you more control to avoid this scenario, and can be written to only match "&" symbols that are not part of other characters, such as &lt;, something like:

string result = Regex.Replace(test, "&(?!(amp|apos|quot|lt|gt);)", "&amp;");

The above works, but admittedly it doesn't cover the variety of other characters that start with an ampersand, such as &nbsp; and the list can grow.

A more flexible approach would be to decode the content of the value attribute, then re-encode it. If you have value="&wow&amp;" the decode process would return "&wow&" then re-encoding it would return "&amp;wow&amp;", which is desirable. To pull this off you could use this:

string result = Regex.Replace(test, @"value=\""(.*?)\""", m => "value=\"" +
    HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups[1].Value)) +
    "\"");
var doc = XElement.Parse(result);

Bear in mind that the above regex only targets the contents of the value attribute. If there are other areas in the XML structure that suffer from the same issue then it can be tweaked to match them and replace their content in a similar fashion.


EDIT: updated solution that should handle content between tags as well as anything between double quotes. Be sure to test this thoroughly. Attempting to manipulate XML/HTML tags with regex is not favorable as it can be error prone and over-complicated. Your case is somewhat special since you need to sanitize it first in order to make use of it.

string pattern = "(?<start>>)(?<content>.+?(?<!>))(?<end><)|(?<start>\")(?<content>.+?)(?<end>\")";
string result = Regex.Replace(test, pattern, m =>
    		m.Groups["start"].Value +
    		HttpUtility.HtmlEncode(HttpUtility.HtmlDecode(m.Groups["content"].Value)) +
    		m.Groups["end"].Value);
var doc = XElement.Parse(result);

这篇关于与解析XML符号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆