正则表达式在HTML文本中的链接 [英] Regex for links in html text

查看:132
本文介绍了正则表达式在HTML文本中的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望这个问题不是RTFM问题。
我正在尝试编写一个从标准HTML网页(< link href ... tags)提取链接的Python脚本。
我在网上搜索了匹配的regexen,发现了很多不同的模式。是否有任何约定的标准正则表达式匹配链接?



Adam



更新:
我正在寻找两个不同的答案:


  1. 解析HTML链接的库解决方案是什么? 美丽的汤似乎是一个很好的解决方案(谢谢, Igal Serban 和 cletus !)

  2. 能否使用正则表达式定义链接​​?
  3. $ b正如其他人所建议的那样,如果不需要实时性能,那么BeautifulSoup就是一个很好的解决方案:

     从BeautifulSoup导入urllib2 
    导入BeautifulSoup

    html = urllib2.urlopen(http ://www.google.com).read()
    soup = BeautifulSoup(html)
    all_links = soup.findAll(a)

    至于第二个问题,是的,HTML链接应该有明确的定义,但实际遇到的HTML不太可能是标准的。 BeautifulSoup的优点在于它使用类似浏览器的启发式方法来试图解析您可能实际遇到的非标准,格式错误的HTML。



    如果您确定要使用标准XHTML,则可以使用(更快)的XML解析器,例如expat。



    正则表达式,出于上述原因(解析器必须保持状态,而正则表达式不能这样做)永远不会是一个通用的解决方案。


    I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the <link href... tags). I have searched the web for matching regexen and found many different patterns. Is there any agreed, standard regex to match links?

    Adam

    UPDATE: I am actually looking for two different answers:

    1. What's the library solution for parsing HTML links. Beautiful Soup seems to be a good solution (thanks, Igal Serban and cletus!)
    2. Can a link be defined using a regex?

    解决方案

    As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is a good solution:

    import urllib2
    from BeautifulSoup import BeautifulSoup
    
    html = urllib2.urlopen("http://www.google.com").read()
    soup = BeautifulSoup(html)
    all_links = soup.findAll("a")
    

    As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.

    If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.

    Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.

    这篇关于正则表达式在HTML文本中的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆