在所有< a>中获取href属性的值使用Python的html文件标签 [英] Getting the value of href attributes in all <a> tags on a html file with Python
问题描述
我使用python构建应用程序,我需要在一个网页中获取所有链接的网址。我已经有了一个使用urllib从网上下载html文件的函数,并将它转换为带有readlines()的字符串列表。
目前我有这个代码使用正则表达式(我不是很擅长)来搜索每一行的链接:
result = re.match('/href=\"(.*)\"/iU',line)
打印结果
这不起作用,因为它只为文件中的每一行输出None,但我相信至少在我打开的文件中有3个链接。
有人可以给我一个提示吗?
预先致谢
那么,为了完整起见,我将在这里添加我发现的最佳答案,并且我在Mark Pilgrim的Dive Into Python一书中找到了它。
下面的代码列出了网页中的所有网址:
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self,attrs ):
href = [v for k,v attrs if k =='href']
if href:
self.urls.extend(href)
import urllib,urllister
usock = urllib.urlopen(http://diveintopython.net/)
parser = urllister.URLLister()
parser.feed(usock.read() )
parser.url中的url
parser.close()
parser.close()
:print url
感谢所有回复。
I'm building an app in python, and I need to get the URL of all links in one webpage. I already have a function that uses urllib to download the html file from the web, and transform it to a list of strings with readlines().
Currently I have this code that uses regex (I'm not very good at it) to search for links in every line:
for line in lines:
result = re.match ('/href="(.*)"/iU', line)
print result
This is not working, as it only prints "None" for every line in the file, but I'm sure that at least there are 3 links on the file I'm opening.
Can someone give me a hint on this?
Thanks in advance
Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.
Here follows the code to list all URL's from a webpage:
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
if href:
self.urls.extend(href)
import urllib, urllister
usock = urllib.urlopen("http://diveintopython.net/")
parser = urllister.URLLister()
parser.feed(usock.read())
usock.close()
parser.close()
for url in parser.urls: print url
Thanks for all the replies.
这篇关于在所有< a>中获取href属性的值使用Python的html文件标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!