在所有< a>中获取href属性的值使用Python的html文件标签 [英] Getting the value of href attributes in all <a> tags on a html file with Python

查看：115 发布时间：2018/6/15 9:19:16 python html regex parsing

本文介绍了在所有< a>中获取href属性的值使用Python的html文件标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用python构建应用程序，我需要在一个网页中获取所有链接的网址。我已经有了一个使用urllib从网上下载html文件的函数，并将它转换为带有readlines（）的字符串列表。

目前我有这个代码使用正则表达式（我不是很擅长）来搜索每一行的链接：

  
 result = re.match（'/href=\"(.*)\"/iU'，line）
打印结果

这不起作用，因为它只为文件中的每一行输出None，但我相信至少在我打开的文件中有3个链接。

有人可以给我一个提示吗？

预先致谢
那么，为了完整起见，我将在这里添加我发现的最佳答案，并且我在Mark Pilgrim的Dive Into Python一书中找到了它。

下面的代码列出了网页中的所有网址：

from sgmllib import SGMLParser class URLLister（SGMLParser）： def reset（self）： SGMLParser.reset（self） self.urls = [] def start_a（self，attrs ）： href = [v for k，v attrs if k =='href'] if href： self.urls.extend（href） import urllib，urllister usock = urllib.urlopen（http://diveintopython.net/） parser = urllister.URLLister（） parser.feed（usock.read（）） parser.url中的url parser.close（） parser.close（）：print url
感谢所有回复。

I'm building an app in python, and I need to get the URL of all links in one webpage. I already have a function that uses urllib to download the html file from the web, and transform it to a list of strings with readlines().

Currently I have this code that uses regex (I'm not very good at it) to search for links in every line:
for line in lines: result = re.match ('/href="(.*)"/iU', line) print result
This is not working, as it only prints "None" for every line in the file, but I'm sure that at least there are 3 links on the file I'm opening.

Can someone give me a hint on this?

Thanks in advance
解决方案
Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.

Here follows the code to list all URL's from a webpage:
from sgmllib import SGMLParser class URLLister(SGMLParser): def reset(self): SGMLParser.reset(self) self.urls = [] def start_a(self, attrs): href = [v for k, v in attrs if k=='href'] if href: self.urls.extend(href) import urllib, urllister usock = urllib.urlopen("http://diveintopython.net/") parser = urllister.URLLister() parser.feed(usock.read()) usock.close() parser.close() for url in parser.urls: print url
Thanks for all the replies.

这篇关于在所有< a>中获取href属性的值使用Python的html文件标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在所有< a>中获取href属性的值使用Python的html文件标签 [英] Getting the value of href attributes in all <a> tags on a html file with Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

在所有&lt; a&gt;中获取href属性的值使用Python的html文件标签 [英] Getting the value of href attributes in all &lt;a&gt; tags on a html file with Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

在所有< a>中获取href属性的值使用Python的html文件标签 [英] Getting the value of href attributes in all <a> tags on a html file with Python

登录关闭