Java:转义 XML 文本内容而不是整个文本 [英] Java: Escape XML text content instead of entire text

查看:28
本文介绍了Java:转义 XML 文本内容而不是整个文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想发送下面的 XML 请求.应该对文本内容进行转义,而不是对标签进行转义.

I want to send the XML request below. Text content should be escaped, but not the tags.

我尝试使用以下转义逻辑.
String str = escapeXml11(req);

但是,我的整个请求都被逃避了.因此,它不再是有效的 XML.

However, my whole request is getting escaped. So, it is no longer valid XML.

我的原始字符串:

String req =
"<request>\r\n" 
  + " <Products>\r\n" 
    + " <Product>\r\n" 
      + " <ProductName>H < M</ProductName>\r\n" 
      + " <quantity>1</quantity>\r\n" 
      + " <totalProductCost>17.03</totalProductCost>\r\n" 
    + " </Product>\r\n" 
  + " </Products>\r\n" 
+ "</request>"; 

逃跑后:

&lt;request&gt;
    &lt;ProductName&gt;H &lt; M&lt;/ProductName&gt;
    &lt;quantity&gt;1&lt;/quantity&gt;
    &lt;totalProductCost&gt;17.03&lt;/totalProductCost&gt;
&lt;/request&gt

预期结果:

<request>
    <ProductName>H &lt; M</ProductName>
    <quantity>1</quantity>
    <totalProductCost>17.03</totalProductCost>
</request>

如何只转义文本内容?

推荐答案

所以这个问题的根源在于第 3 方提供给你的XML"格式不正确.

So the root of this problem is that the "XML" that the 3rd-party is providing to you is not well-formed.

<request>
  <Products>
    <Product>
      <ProductName>H < M</ProductName>
      <quantity>1</quantity>
      <totalProductCost>17.03</totalProductCost>
    </Product>
  </Products> 
</request>

要更正此问题,您需要将 "H < M" 转换为 "H &lt; M".人类很容易做到这一点,如果人类必须做很多这样的事情,模精度问题.但是自动化很困难.

To correct this, you would need to convert the "H < M" to "H &lt; M". It is easy for a human to do this, modulo accuracy issues if the human has to do a lot of this. But automating it is difficult.

显然,简单地调用转义方法是行不通的.不解析 XML,转义方法无法确定需要转义的内容.(escapeXml11 之类的方法仅在需要转义整个字符串时才有效.)

Obviously, simply calling an escape method won't work. An escape method can't determine what needs to be escaped without parsing the XML. (Methods like escapeXml11 only work if the entire string needs to be escaped.)

普通的 XML 解析器会看到 "< M" 并尝试将其视为元素标记的开始.然后它会看到下一个 "<" ... 和错误.要进一步进行,它必须回溯到 "< M" 并将 "<" 当作它被转义处理.

A normal XML parser would see the "< M" an try to treat this as the start of an element tag. Then it would see the next "<" ... and error. To proceed further, it has to backtrack to the "< M" and treat the "<" as if it was escaped.

我知道有一种 HTML/XML 解析器 (JSoup) 可以处理错位的 "<" 字符.但是,如果我理解正确,它会以错误的方式为您的用例处理这个问题.它不会将 "< M" 视为数据,而是将其转换为开始标记:

I am aware of one HTML / XML parser (JSoup) that can deal with misplaced "<" characters. However, if I understand things correctly, it deals with this problem wrong way for your use-case. Instead of treating the "< M" as data it would turn it into a start tag:

<request>
  <Products>
    <Product>
      <ProductName>H <M></ProductName>
      <quantity>1</quantity>
      <totalProductCost>17.03</totalProductCost>
    </Product>
  </Products> 
</request>

<小时>

这给你留下了两个选择:


That leaves you with two alternatives:

  • 您可以尝试通过一些模式匹配来检测和修复问题.例如,如果您知道格式错误的数据位于 <ProductName>...</ProductName> 元素中,那么您可以使用正则表达式来搜索这些元素,检查并(如有必要)) 更正内容,并替换它.

  • You could try to detect and fix the problem with some pattern matching. For example, if you know that the malformed data is in <ProductName>...</ProductName> elements, then you could use a regex to search for these elements, check and (if necessary) correct the content, and replace it.

您可以使用上下文相关的词法分析器为您的 XML 编写自定义解析器.当解析器看到 时,它会将词法分析器切换到不同的模式,将<"视为数据除非 的开始.

You could write a custom parser for your XML with a context-sensitive lexer. When the parser sees a <ProductName>, it switches the lexer into a different mode that treats "<" as data unless it is the start of </ProductName>.

但在你花时间和费用编写一堆自定义代码来处理这个无效的 XML 之前:

But before you go to the time and expense of writing a bunch of custom code to deal with this invalid XML:

  • 向创建它的第 3 方投诉.他们不应该那样排放垃圾.他们的软件或他们的数据收集/清理有缺陷.他们应该修复它.

  • Complain to the 3rd-party that is creating it. They should not be emitting rubbish like that. Their software or their data collection / sanitization is flawed. They should fix it.

确保支付您的软件开发和维护费用的任何人都知道这一点.例如,如果您签约编写一些处理 XML 的软件,那么这不是 XML.如果客户没有警告您您的软件需要处理格式错误的 XML,那么这就是需求的变化,并且可能(应该)是合同的变化.

Make sure that whoever is paying your software development and maintenance bills gets to know about this. For example, if you were contracted to write some software that processes XML, this is not XML. If the customer didn't warn you that your software needed to cope with malformed XML, that is a change of requirements and could be (should be) a variation of the contract.

另见@Michael Kay 的评论.

See also @Michael Kay's comment.

这篇关于Java:转义 XML 文本内容而不是整个文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆