正则表达式去掉HTML标签 [英] Regex to strip HTML tags
问题描述
我有这个HTML输入:
I have this HTML input:
<font size="5"><p>some text</p>
<p> another text</p></font>
我想用正则表达式去除HTML标签,以便输出:
I'd like to use regex to remove the HTML tags so that the output is:
some text
another text
任何人都可以建议如何使用正则表达式来完成这项工作吗?
Can anyone suggest how to do this with regex?
推荐答案
Jericho Html解析器。
You can go with HTML parser called Jericho Html parser.
你可以从这里下载 - http://jericho.htmlparser.net/docs/index.html
you can download it from here - http://jericho.htmlparser.net/docs/index.html
Jericho HTML Parser是一个Java库,允许分析和处理零件的HTML文档(包括服务器端标签),同时复制任何无法识别或无效的HTML。它还提供了高级别的HTML表单操作功能。
Jericho HTML Parser is a java library allowing analysis and manipulation of parts of an HTML document, including server-side tags, while reproducing verbatim any unrecognized or invalid HTML. It also provides high-level HTML form manipulation functions.
格式不正确的HTML不会影响解析
The presence of badly formatted HTML does not interfere with the parsing
这篇关于正则表达式去掉HTML标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!