删除XML中的流氓&符的最佳方法是什么? [英] What is the best way of removing rogue ampersands in XML?

查看：76 发布时间：2020/5/1 7:41:19 c# regex xml linq-to-xml

本文介绍了删除XML中的流氓&符的最佳方法是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

(底部的TLDR)

我们有一个遗留系统，该系统已实现了自己的XML读取器/写入器.问题在于它允许使用文字&"在属性值中.

<SB nae="Name" net="HV & DD"/>

当我使用XDocument.Parse()方法读取数据时，这当然会失败.我正在研究清理数据的方法.

我正在尝试使用正则表达式来确定发生这种情况的情况. 为了说明这一点，请考虑以下问题:

&(?!amp\;)

这将以负的前瞻性标识&"号，以确保它实际上不是正确转义的&"号.确定了这些情况后，我可以用适当的&

当然，存在一个问题，它将与其他转义的字符匹配，例如& gt& lt& quot等，因此我也需要取消匹配.也许使用更通用的形式，例如正则表达式不匹配的&"号，后跟2-4个字符，然后是分号.

但是我担心的是，还有其他我不曾想到的&"号案例，这些案例在我获得的少数样本中没有体现.我正在寻找一种不会弄乱适当xml的安全方法.

TLDR:我如何识别不属于正确xml的&"号，但是属性值中未转义的&"号的情况是什么?

解决方案

您可以将以下正则表达式模式替换为&:

&(?!(?:#\d+|#x[0-9a-f]+|\w+);)

演示: https://regex101.com/r/3MTLY9/2

(TLDR at the bottom)

We have a legacy system that has implemented its own XML reader/writer. The problem is that it allows a literal "&" inside a property value.

<SB nae="Name" net="HV & DD"/>

When I am reading the data using XDocument.Parse() method, this fails of course. I am looking at ways of sanitizing the data.

I am attempting to use regex to identify cases where this is happening. To illustrate, consider this:

&(?!amp\;)

This will identify ampersand with a negative lookahead to ensure it isn't actually a correctly escaped ampersand. When I have identified these cases, I can substitute with a proper &

Of course, there is a problem that this will match other escaped character such &gt &lt &quot etc, so I need to unmatch those as well. Maybe using a more general form, like a regex unmatching ampersand followed by 2-4 characters and then semicolon.

But my worry is that there are other cases for ampersands that I am not thinking of and that are not represented in the few samples I have got. I am looking for a safe way that will not mess up proper xml.

TLDR: How do I identify ampersands that are not part of proper xml, but are cases of unescaped ampersands in property values?

解决方案

You can substitute the following regex pattern with &:

&(?!(?:#\d+|#x[0-9a-f]+|\w+);)

Demo: https://regex101.com/r/3MTLY9/2

这篇关于删除XML中的流氓&符的最佳方法是什么?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

删除XML中的流氓&符的最佳方法是什么? [英] What is the best way of removing rogue ampersands in XML?

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

删除XML中的流氓&符的最佳方法是什么? [英] What is the best way of removing rogue ampersands in XML?

问题描述

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭