在所有< a>中获取href属性的值使用Python的html文件标签 [英] Getting the value of href attributes in all <a> tags on a html file with Python

查看:115
本文介绍了在所有< a>中获取href属性的值使用Python的html文件标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用python构建应用程序,我需要在一个网页中获取所有链接的网址。我已经有了一个使用urllib从网上下载html文件的函数,并将它转换为带有readlines()的字符串列表。



目前我有这个代码使用正则表达式(我不是很擅长)来搜索每一行的链接:

  
result = re.match('/href=\"(.*)\"/iU',line)
打印结果

这不起作用,因为它只为文件中的每一行输出None,但我相信至少在我打开的文件中有3个链接。



有人可以给我一个提示吗?



预先致谢
那么,为了完整起见,我将在这里添加我发现的最佳答案,并且我在Mark Pilgrim的Dive Into Python一书中找到了它。



下面的代码列出了网页中的所有网址:

  from sgmllib import SGMLParser 

class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []

def start_a(self,attrs ):
href = [v for k,v attrs if k =='href']
if href:
self.urls.extend(href)

import urllib,urllister
usock = urllib.urlopen(http://diveintopython.net/)
parser = urllister.URLLister()
parser.feed(usock.read() )
parser.url中的url
parser.close()
parser.close()
:print url

感谢所有回复。


I'm building an app in python, and I need to get the URL of all links in one webpage. I already have a function that uses urllib to download the html file from the web, and transform it to a list of strings with readlines().

Currently I have this code that uses regex (I'm not very good at it) to search for links in every line:

for line in lines:
    result = re.match ('/href="(.*)"/iU', line)
    print result

This is not working, as it only prints "None" for every line in the file, but I'm sure that at least there are 3 links on the file I'm opening.

Can someone give me a hint on this?

Thanks in advance

解决方案

Well, just for completeness I will add here what I found to be the best answer, and I found it on the book Dive Into Python, from Mark Pilgrim.

Here follows the code to list all URL's from a webpage:

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):                              
        SGMLParser.reset(self)
        self.urls = []

    def start_a(self, attrs):                     
        href = [v for k, v in attrs if k=='href']  
        if href:
            self.urls.extend(href)

import urllib, urllister
usock = urllib.urlopen("http://diveintopython.net/")
parser = urllister.URLLister()
parser.feed(usock.read())         
usock.close()      
parser.close()                    
for url in parser.urls: print url

Thanks for all the replies.

这篇关于在所有< a>中获取href属性的值使用Python的html文件标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆