使用Beautifulsoup从URL中提取链接 [英] Extract link from url using Beautifulsoup

查看:73
本文介绍了使用Beautifulsoup从URL中提取链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用beautifulsoup获取以下内容的网络链接

I am trying to get the web link of the following, using beautifulsoup

<div class="alignright single">
<a href="http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-women-should-lower-their-garments-to-cover-their-feet/" rel="next">Hadith on Clothing: Women should lower their garments to cover their feet</a> &raquo;    </div>
</div>

我的代码如下

from bs4 import BeautifulSoup                                                                                                                                 
import urllib2                                                                                                
url1 = "http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/"

content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1) 

nextlink = soup.findAll("div", {"class" : "alignright single"})
a = nextlink.find('a')
print a.get('href')

我收到以下错误,请帮助

I get the following error, please help

a = nextlink.find('a')
AttributeError: 'ResultSet' object has no attribute 'find'

推荐答案

如果只想找到一个匹配项,请使用 .find():

Use .find() if you want to find just one match:

nextlink = soup.find("div", {"class" : "alignright single"})

在所有匹配项中

loop :

for nextlink in soup.findAll("div", {"class" : "alignright single"}):
    a = nextlink.find('a')
    print a.get('href')

后一部分也可以表示为:

The latter part can also be expressed as:

a = nextlink.find('a', href=True)
print a['href']

其中 href = True 部分仅匹配具有 href 属性的元素,这意味着您不必使用 a.get(),因为属性将会存在(或者,没有找到< a href ="..."> 链接,也没有 a 将为 None ).

where the href=True part only matches elements that have a href attribute, which means that you won't have to use a.get() because the attribute will be there (alternatively, no <a href="..."> link is found and a will be None).

对于您问题中给定的URL,只有一个这样的链接,因此 .find()可能是最方便的.甚至可以只使用:

For the given URL in your question, there is only one such link, so .find() is probably most convenient. It may even be possible to just use:

nextlink = soup.find('a', rel='next', href=True)
if nextlink is not None:
    print a['href']

,无需查找周围的 div . rel ="next" 属性足以满足您的特定需求.

with no need to find the surrounding div. The rel="next" attribute looks enough for your specific needs.

另一个提示:利用响应标头告诉BeautifulSoup页面使用哪种编码; urllib2 响应对象可以告诉您服务器认为HTML页面编码的字符集(如果有),

As an extra tip: make use of the response headers to tell BeautifulSoup what encoding to use for a page; the urllib2 response object can tell you what, if any, character set the server thinks the HTML page is encoded in:

response = urllib2.urlopen(url1)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))

所有部分的快速演示:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> response = urllib2.urlopen('http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/')
>>> soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
>>> soup.find('a', rel='next', href=True)['href']
u'http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-women-should-lower-their-garments-to-cover-their-feet/'

这篇关于使用Beautifulsoup从URL中提取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆