正则表达式从HTML中提取文本 [英] regular expression to extract text from HTML

查看:1325
本文介绍了正则表达式从HTML中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从一般HTML页面中提取所有文本(显示与否)。



我想删除


  • 任何HTML标记

  • 任何javascript

  • 任何CSS样式



是否有一个正则表达式(一个或多个)可以实现这一点?

你不能真正用正则表达式解析HTML。这太复杂了。 RE不会正确处理<![CDATA []部分。此外,诸如& lt; text> 之类的一些常见的HTML文件可以在浏览器中作为正确的文本工作,但可能会导致难以理解的RE。



使用适当的HTML解析器,您会更快乐,更成功。 Python人通常使用美丽的汤来解析HTML并去掉标签和脚本。






另外,浏览器在设计上容忍格式错误的HTML。所以你经常会发现自己试图解析显然不合适的HTML,但在浏览器中恰好运行正常。

你可能能够用RE解析坏HTML。它只需要耐心和努力。但使用其他人的解析器通常更简单。


I would like to extract from a general HTML page, all the text (displayed or not).

I would like to remove

  • any HTML tags
  • Any javascript
  • Any CSS styles

Is there a regular expression (one or more) that will achieve that?

解决方案

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like &lt;text> will work in a browser as proper text, but might baffle a naive RE.

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.


Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

这篇关于正则表达式从HTML中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆