Jsoup仅删除html标签 [英] Jsoup remove ONLY html tags

查看:106
本文介绍了Jsoup仅删除html标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用JSOUP(不使用正则表达式)删除仅html标签(保留所有自定义/未知标签)的正确方法是什么?

What is proper way to remove ONLY html tags (preserve all custom/unknown tags) with JSOUP (NOT regex)?

预期输入:

<html>
  <customTag>
    <div> dsgfdgdgf </div>
  </customTag>
  <123456789/>
  <123>
  <html123/>
</html>

预期输出:

  <customTag>
     dsgfdgdgf
  </customTag>
  <123456789/>
  <123>
  <html123/>

我尝试将Cleaner与WhiteList.none()结合使用,但它也会删除自定义标签.

I tried to use Cleaner with WhiteList.none(), but it removes custom tags also.

我也尝试过:

String str = Jsoup.parse(html).text()

但是它也会删除自定义标签.

But it removes custom tags also.

这个 answer 对我不好,因为自定义标签的数量是无穷大.

This answer isn't good for me, because number of custom tags is infinity.

推荐答案

您可能想尝试这样的事情:

you might want to try something like this:

String[] tags = new String[]{"html", "div"};
Document thing = Jsoup.parse("<html><customTag><div>dsgfdgdgf</div></customTag><123456789/><123><html123/></html>");
for (String tag : tags) {
    for (Element elem : thing.getElementsByTag(tag)) {
        elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
        elem.remove();
    }
}
System.out.println(thing.getElementsByTag("body").html());

请注意,< 123456789/>和< 123>不符合xml标准,因此它们可以转义.另一个缺点是,您必须明确写下所有您不喜欢的标签(也就是所有html标签),而且它可能太糟了.还没有看这将运行多快.

Please note that <123456789/> and <123> don't conform to the xml standard, so they get escaped. Another downside may be that you have to explicitly write down all tags you don't like (aka all html tags) and it may be sloooooow. Haven't looked at how fast this is going to run.

MFG MiSt

这篇关于Jsoup仅删除html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆