正则表达式从字符串中删除HTML标签 [英] Regular expression to remove HTML tags from a string

查看:131
本文介绍了正则表达式从字符串中删除HTML标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


可能存在重复:

正则表达式去除HTML标签


一个表达式将获得两个HTML标签之间的值?



鉴于此:

 < td class =播放> 0< / td> 

我正在寻找一个表达式,它将返回 0 < td> 标签。

解决方案

您不应该试图用正则表达式解析HTML。 HTML不是一种常规语言,所以你提出的任何正则表达式可能会在一些神秘的边缘情况下失败。请参阅这个问题的具体细节。虽然大部分格式化为一个笑话,它是一个很好的观点。






以下示例是Java,但正则表达式对于其他语言将是相似的 - 如果不是相同的话。


$ b code> String target = someString.replaceAll(< [>] *>,);

假设您的非html不包含任何<或>,并且您的输入字符串结构正确。



如果您知道它们是特定的标签 - 例如您知道文本只包含< td> 标签,你可以这样做:

  String target = someString。 replaceAll((?i)< td [^>]>,); 

编辑:
Ω在另一篇文章的评论中提出了一个很好的观点,例如,如果输入字符串是< td> Something< / p& td>< td>另一件事< / td> ,那么上面的结果会导致 SomethingAnother Thing


在预期有多个标签的情况下,我们可以这样做:

  String target = someString .replaceAll((?i)< td [^>]>,).replaceAll(\\s +,).trim(); 

这将HTML替换为一个空格,然后折叠空白,然后修剪任何两端。

Possible Duplicate:
Regular expression to remove HTML tags

Is there an expression which will get the value between two HTML tags?

Given this:

<td class="played">0</td>

I am looking for an expression which will return 0, stripping the <td> tags.

解决方案

You should not attempt to parse HTML with regex. HTML is not a regular language, so any regex you come up with will likely fail on some esoteric edge case. Please refer to the seminal answer to this question for specifics. While mostly formatted as a joke, it makes a very good point.


The following examples are Java, but the regex will be similar -- if not identical -- for other languages.


String target = someString.replaceAll("<[^>]*>", "");

Assuming your non-html does not contain any < or > and that your input string is correctly structured.

If you know they're a specific tag -- for example you know the text contains only <td> tags, you could do something like this:

String target = someString.replaceAll("(?i)<td[^>]*>", "");

Edit: Ωmega brought up a good point in a comment on another post that this would result in multiple results all being squished together if there were multiple tags.

For example, if the input string were <td>Something</td><td>Another Thing</td>, then the above would result in SomethingAnother Thing.

In a situation where multiple tags are expected, we could do something like:

String target = someString.replaceAll("(?i)<td[^>]*>", " ").replaceAll("\\s+", " ").trim();

This replaces the HTML with a single space, then collapses whitespace, and then trims any on the ends.

这篇关于正则表达式从字符串中删除HTML标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆