Jsoup-如何通过转义不删除不需要的HTML来清理HTML? [英] Jsoup - Howto clean html by escaping not deleting the unwanted html?

查看:150
本文介绍了Jsoup-如何通过转义不删除不需要的HTML来清理HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种方法可以让jsoup通过转义不需要的HTML而不是完全删除它来清理其中包含HTML的字符串?我的例子:

Is there a way of getting jsoup to clean a string with HTML in it by escaping the unwanted HTML rather than removing it completely? My example:

String dirty = "This is <b>REALLY</b> dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>
String clean = Jsoup.clean(dirty, new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));

这给出了一个干净"的字符串:

This gives a "clean" string of:

This is    REALLY    dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>

我想要的是干净"的字符串:

What I am wanting is the "clean" string to be:

"This is &lt;b&gt;REALLY&lt;/b&gt; dirty code from <a href="www.rubbish.url.zzzz">haxors-r-us</a>

推荐答案

假设要解析字符串而不是HTML文档(根据您的问题),此方法将起作用:

Assuming String rather than HTML documents are being parsed (as per your question) this method will work:

public String escapeHtml(String source) {
    Document doc = Jsoup.parseBodyFragment(source);
    Elements elements = doc.select("b");
    for (Element element : elements) {
        element.replaceWith(new TextNode(element.toString(),""));
    }
    return Jsoup.clean(doc.body().toString(), new Whitelist().addTags("a").addAttributes("a", "href", "name", "rel", "target"));
}

您可以将"b"标签设为自变量,以传递要转义的标签列表.

You could make the "b" tag an argument to pass in a list of tags you wish to escape.

关联的通过JUnit测试:

The associated passing JUnit test:

@Test
public void testHtmlEscaping() throws Exception {
    String source = "This is <b>REALLY</b> dirty code from <a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
    String expected = "This is &lt;b&gt;REALLY&lt;/b&gt; dirty code from \n<a href=\"www.rubbish.url.zzzz\">haxors-r-us</a>";
    String transformed = transformer.escapeHtml(source);
    assertEquals(transformed, expected);
}

请注意,由于JSoup格式化了页面,因此我在测试的预期"字符串中的"a"标记之前添加了行返回"\ n".

Note that I added a line return "\n" before your "a" tag in my test's "expected" String because JSoup formats the page.

这篇关于Jsoup-如何通过转义不删除不需要的HTML来清理HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆