将XML实体保留在输出(jSoup) [英] Keep XML entities in output (jSoup)
问题描述
»
与html实体:& raquo;
如何保留原始(xml)实体?
Groovy脚本:
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Entities
import org.jsoup.parser.Parser
字符串HTML_STRING ='''
< html>
< div>< / div>
< div>某些文字&#187;< / div>
< / html>
'''
文档doc = Jsoup.parse(新的ByteArrayInputStream(HTML_STRING.getBytes(UTF-8)),UTF-8,,Parser.xmlParser )
doc.outputSettings()。charset(UTF-8)
doc.outputSettings()。escapeMode(Entities.EscapeMode.base)
println doc.toString()
结果:
< HTML>
< div>< / div>
< div>
一些文字& raquo;
< / div>
< / html>
如果我使用 Entities.EscapeMode.xhtml
的结果是:
< html>
< div>< / div>
< div>
一些文本»
< / div>
< / html>
谢谢。
您想使用 EscapeMode.xhtml
(如果您使用XML解析器而不是HTML解析器是默认值)的组合,以及 ascii
作为输出字符集。
默认输出字符集为UTF-8,而jsoup则更愿意不使用实体如果输出字符集直接支持字符,则转义(因为为什么浪费CPU和带宽以及不必要的转义)。
如果将输出字符集更改为 ascii
使用 Document.OutputSettings.charset(ascii)
,您将得到所需的输出。
您也可能希望将输出语法设置为 XML 如果你是工作机智h HTML,否则HTML解析器将尝试使输出确认为HTML,并可以将您的XML DOM树缩小。
(来源:jsoup的作者)
I'm using jsoup to do some xml processing. Problem is, it is replacing xml entities, ie.: »
with html entities: »
How could I keep original (xml) entities?
Groovy script:
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.jsoup.nodes.Entities
import org.jsoup.parser.Parser
String HTML_STRING = '''
<html>
<div></div>
<div>Some text »</div>
</html>
'''
Document doc = Jsoup.parse(new ByteArrayInputStream(HTML_STRING.getBytes("UTF-8")), "UTF-8", "", Parser.xmlParser())
doc.outputSettings().charset("UTF-8")
doc.outputSettings().escapeMode(Entities.EscapeMode.base)
println doc.toString()
Result:
<html>
<div></div>
<div>
Some text »
</div>
</html>
If I use Entities.EscapeMode.xhtml
the result is:
<html>
<div></div>
<div>
Some text »
</div>
</html>
Thanks.
You want to use a combination of EscapeMode.xhtml
(which is the default if you use the XML parser, not the HTML parser), and ascii
as the output character set.
The default output charset is UTF-8, and jsoup will prefer to not use entity escapes if the output charset supports the character directly (because why waste CPU and bandwidth with unnecessary escapes).
If you change the output charset to ascii
using Document.OutputSettings.charset("ascii")
you'll get the output you want.
You also probably want to set the output syntax to XML if you are working with HTML, as otherwise the HTML parser will try to make the output confirm to HTML and can munge your XML DOM tree.
(Source: author of jsoup)
这篇关于将XML实体保留在输出(jSoup)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!