从java中的字符串中除去几个特定的​​HTML标记 [英] Removing Html tags except few specific ones from String in java

查看:61
本文介绍了从java中的字符串中除去几个特定的​​HTML标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的输入是纯文本字符串,并且要求删除除以下特定标记之外的所有html标记:

My input is plain text string and requirement is to remove all html tags except few specific tags like:

<p>
<li>
<u>
<li>

如果这些特定标签的属性像 class id ,我想删除这些属性。

If these specific tags have attributes like class or id, I want to remove these attributes.

几个例子:

A few examples:

<a href = "#">Link</a>            ->   Link

<p>paragraph</p>                  ->   <p>paragraph</p>

<p class="class1">paragraph</p>   ->   <p>paragraph</p>

我已经通过这个从字符串中删除HTML标记,但它不能完全回答我的问题。

I have gone through this Remove HTML tags from a String but it does not answer my question completely.

可以吗是由一组正则表达式的处理或我可以使用一些库吗?

Can it be handled by a set of regex's or could I make use of some library?

推荐答案

我试过JSoup和它似乎能够处理所有这些案件。以下是示例代码。

I tried JSoup and It seems to be able to handle all such cases. Here is example code.

 public String clean(String unsafe){
        Whitelist whitelist = Whitelist.none();
        whitelist.addTags(new String[]{"p","br","ul"});

        String safe = Jsoup.clean(unsafe, whitelist);
        return StringEscapeUtils.unescapeXml(safe);
 }

对于输入字符串

String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";

我得到以下几乎所需的输出。

I get following output which is pretty much I require.

<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>

这篇关于从java中的字符串中除去几个特定的​​HTML标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆