从java中的字符串中除去几个特定的HTML标记 [英] Removing Html tags except few specific ones from String in java
问题描述
我的输入是纯文本字符串,并且要求删除除以下特定标记之外的所有html标记:
My input is plain text string and requirement is to remove all html tags except few specific tags like:
<p>
<li>
<u>
<li>
如果这些特定标签的属性像 class
或 id
,我想删除这些属性。
If these specific tags have attributes like class
or id
, I want to remove these attributes.
几个例子:
A few examples:
<a href = "#">Link</a> -> Link
<p>paragraph</p> -> <p>paragraph</p>
<p class="class1">paragraph</p> -> <p>paragraph</p>
我已经通过这个从字符串中删除HTML标记,但它不能完全回答我的问题。
I have gone through this Remove HTML tags from a String but it does not answer my question completely.
可以吗是由一组正则表达式的处理或我可以使用一些库吗?
Can it be handled by a set of regex's or could I make use of some library?
推荐答案
我试过JSoup和它似乎能够处理所有这些案件。以下是示例代码。
I tried JSoup and It seems to be able to handle all such cases. Here is example code.
public String clean(String unsafe){
Whitelist whitelist = Whitelist.none();
whitelist.addTags(new String[]{"p","br","ul"});
String safe = Jsoup.clean(unsafe, whitelist);
return StringEscapeUtils.unescapeXml(safe);
}
对于输入字符串
String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";
我得到以下几乎所需的输出。
I get following output which is pretty much I require.
<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>
这篇关于从java中的字符串中除去几个特定的HTML标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!