为什么此提取对示例有效,但对实际网址却无效? [英] Why does this extraction work fine on example, but not on real url?

查看:62
本文介绍了为什么此提取对示例有效,但对实际网址却无效?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图提取<td class="DataZone">内的类ahref的内容.在下面的示例中可以使用

I'm trying to extract the content of href in class a, which is inside <td class="DataZone">. It works in below example

from bs4 import BeautifulSoup

text = '''
<td class="DataZone"><div id="Content_CA_DI_0_DataZone">
<div style="font:bold 8pt 'Courier New';letter-spacing:-1px">
<a href="Browse-A">A</a> <a href="Browse-B">B</a> <a href="Browse-C">C</a> <a href="Browse-D">D</a> 
</div>
</div></td>
'''

soup = BeautifulSoup(text, 'html.parser')

[tag.attrs['href'] for tag in soup.select('td.DataZone a')]

,结果为['Browse-A', 'Browse-B', 'Browse-C', 'Browse-D'].当我将其应用于实际的 url 时,它不幸无法正常工作

, and the result is ['Browse-A', 'Browse-B', 'Browse-C', 'Browse-D']. When I apply it on real url, it unfortunately does not work

import requests
session = requests.Session()
from bs4 import BeautifulSoup

url = 'https://www.thefreedictionary.com'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
r = session.get(url, headers = headers) 
soup = BeautifulSoup(r.content, 'html.parser')

[tag.attrs['href'] for tag in soup.select('td.DataZone a')]

返回错误

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-0a06dde2d97b> in <module>
      4 soup = BeautifulSoup(r.content, 'html.parser')
      5 
----> 6 [tag.attrs['href'] for tag in soup.select('td.DataZone a')]

<ipython-input-12-0a06dde2d97b> in <listcomp>(.0)
      4 soup = BeautifulSoup(r.content, 'html.parser')
      5 
----> 6 [tag.attrs['href'] for tag in soup.select('td.DataZone a')]

KeyError: 'href'

很明显,URL的来源与示例类似

Clearly, the source of url is similar to the example

能否请您解释为什么会发生这种错误?

Could you please explain why such error occurs?

更新:对我来说,[x['href'] for x in soup.select('td.DataZone a[href^=Browse]')]可以正常运行,但是[x['href'] for x in soup.select('td.DataZone a')]却不能正常运行.请详细说明这个问题.

Update: It's weird for me that [x['href'] for x in soup.select('td.DataZone a[href^=Browse]')] works fine, but not [x['href'] for x in soup.select('td.DataZone a')]. Please elaborate on the issue too.

推荐答案

您会收到错误,因为有很多td.Datazone标记,并且其中一个标记内有<a>Google+</a>-不含href.

You're getting the error, because there's many td.Datazone tags, and inside one of the tag there's <a>Google+</a> - which is without href.

您可以通过td.DataZone a[href]选择,以仅选择具有href属性的<a>标签:

You can select by td.DataZone a[href] to select only <a> tags with href attribute:

print( [tag.attrs['href'] for tag in soup.select('td.DataZone a[href]')] )

这篇关于为什么此提取对示例有效,但对实际网址却无效?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆