html文本中链接的正则表达式 [英] Regex for links in html text

查看:23
本文介绍了html文本中链接的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望这个问题不是 RTFM 问题.我正在尝试编写一个从标准 HTML 网页(<link href... 标签)中提取链接的 Python 脚本.我在网上搜索了匹配的正则表达式,发现了许多不同的模式.是否有任何商定的标准正则表达式来匹配链接?

I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the <link href... tags). I have searched the web for matching regexen and found many different patterns. Is there any agreed, standard regex to match links?

亚当

更新:我实际上正在寻找两个不同的答案:

UPDATE: I am actually looking for two different answers:

  1. 解析 HTML 链接的库解决方案是什么.Beautiful Soup 似乎是一个不错的解决方案(谢谢,Igal Serbancletus!)
  2. 可以使用正则表达式定义链接​​吗?
  1. What's the library solution for parsing HTML links. Beautiful Soup seems to be a good solution (thanks, Igal Serban and cletus!)
  2. Can a link be defined using a regex?

推荐答案

正如其他人所建议的,如果不需要类似实时的性能,BeautifulSoup 是一个很好的解决方案:

As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is a good solution:

import urllib2
from BeautifulSoup import BeautifulSoup

html = urllib2.urlopen("http://www.google.com").read()
soup = BeautifulSoup(html)
all_links = soup.findAll("a")

关于第二个问题,是的,HTML链接应该是明确定义的,但是你实际遇到的HTML不太可能是标准的.BeautifulSoup 的美妙之处在于它使用类似浏览器的启发式方法来尝试解析您可能实际遇到的非标准、格式错误的 HTML.

As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.

如果您确定要使用标准 XHTML,则可以使用(快得多)速度的 XML 解析器,例如 expat.

If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.

正则表达式,由于上述原因(解析器必须维护状态,而正则表达式不能这样做)永远不会是一个通用的解决方案.

Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.

这篇关于html文本中链接的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆