Jsoup解开特殊字符 [英] Jsoup unescapes special characters

查看：240 发布时间：2017/8/28 22:37:35 html character-encoding escaping jsoup

本文介绍了Jsoup解开特殊字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Jsoup从HTML页面中删除所有图像。
我通过HTTP响应接收页面 - 其中还包含内容字符集。

I'm using Jsoup to remove all the images from an HTML page. I'm receiving the page through an HTTP response - which also contains the content charset.

问题是Jsoup会解除一些特殊字符。

The problem is that Jsoup unescapes some special characters.

例如，对于输入：

<html><head></head><body><p>isn&rsquo;t</p></body></html>

运行

String check = "<html><head></head><body><p>isn&rsquo;t</p></body></html>";
Document doc = Jsoup.parse(check);
System.out.println(doc.outerHtml());

我得到：

<html><head></head><body><p>isn’t</p></body></html><p></p>

我想避免更改html ，除了删除图像。 / strong>

I want to avoid changing the html in any other way except for removing the images.

使用命令：

doc.outputSettings().prettyPrint(false).charset("ASCII").escapeMode(EscapeMode.extended);

我得到正确的输出，但我确定有些情况下，该字符集不会好。我只想使用HTTP标头中指定的字符集，我恐怕这将以我无法预测的方式改变我的文档。
有没有其他更清洁的方法来删除图像而不改变任何其他无意识？

I do get the correct output but I'm sure there are cases where that charset won't be good. I just want to use the charset specified in the HTTP header and I'm afraid this will change my document in ways I can't predict. Is there any other cleaner method for removing the images without changing anything else inadvertently?

谢谢！

推荐答案

以下是除HTTP头中指定的字符集之外的任何字符集的解决方法。

Here is a workaround not involving any charset except the one specified in the HTTP header.

String check = "<html><head></head><body><p>isn’t</p></body></html>".replaceAll("&([^;]+?);", "**$1;"); Document doc = Jsoup.parse(check); doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended); System.out.println(doc.outerHtml().replaceAll("\\*\\*([^;]+?);", "&$1;"));

OUTPUT

<html><head></head><body><p>isn’t</p></body></html>

讨论

我希望在Jsoup的API中有一个解决方案 - @dlv

I wish there was a solution in Jsoup's API - @dlv

使用Jsoup'API将需要编写一个自定义的NodeVisitor。这将导致（重新）发现Jsoup中的一些现有代码。定制的Nodevisitor会生成一个HTML转义码，而不是一个unicode字符。

Using Jsoup'API would require you to write a custom NodeVisitor. It would leads to (re)inventing some existing code inside Jsoup. The custom Nodevisitor would generate back an HTML escape code instead of a unicode character.

另一个选项将涉及编写自定义字符编码器。默认的UTF-8字符编码器可以编码& rsquo; 。这就是为什么Jsoup不会在最终的HTML代码中保留原来的转义序列。

Another option would involve writing a custom character encoder. The default UTF-8 character encoder can encode ’. This is why Jsoup doesn't preserve the original escape sequence in the final HTML code.

上述两个选项中的任何一个代表了很大的编码工作。最终，Jsoup可以添加一个增强功能，让我们选择如何在最终的HTML代码中生成字符：十六进制转义（& #AB; ），十进制转义（原始转义序列（& rsquo; ）或写入编码的字符（其中&＃151; ）在你的帖子中是这样）。

Any of the two above options represents a big coding effort. Ultimately, an enhancement could be added to Jsoup for letting us choose how to generate the characters in the final HTML code : hexadecimal escape (&#AB;), decimal escape (), the original escape sequence (’) or write the encoded character (which is the case in your post).

这篇关于Jsoup解开特殊字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Jsoup解开特殊字符 [英] Jsoup unescapes special characters

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Jsoup解开特殊字符 [英] Jsoup unescapes special characters

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭