用于获取链接的Beautifulsoup和Soupstrainer不适用于hasattr，始终返回true [英] Beautifulsoup and Soupstrainer for getting links dont work with hasattr, returning always true

查看：109 发布时间：2020/7/31 2:05:15 python python-3.x web-scraping beautifulsoup hasattr

本文介绍了用于获取链接的Beautifulsoup和Soupstrainer不适用于hasattr，始终返回true的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在将Beautifulsoup4和Soupstrainer与Python 3.3一起用于从网页获取所有链接.以下是重要的代码段:

i am using Beautifulsoup4 and Soupstrainer with Python 3.3 for getting all links from a webpage. The following is the important code-snippet:

r = requests.get(adress, headers=headers)
for link in BeautifulSoup(r.text, parse_only=SoupStrainer('a')):
    if hasattr(link, 'href'):

我测试了一些网页，效果很好，但是今天使用时

I tested some webpages and it works very well but today when using

adress = 'http://www.goldentigercasino.de/'

我认识到，即使没有诸如"goldentigercasino.de"示例中的"href"字段，hasattr(link，'href')始终返回TRUE. 正因为如此，我在后期使用link ['href']时遇到了麻烦，因为它根本不在那儿.

I recognized that hasattr(link, 'href') always returns TRUE even when there is no such 'href' field, like in the goldentigercasino.de example. Because of that im getting troubles for late using link['href'] because its simply not there.

我也尝试过这样的解决方法:

I also tried a workaround like this:

test = requests.get('http://www.goldentigercasino.de/')
for link in BeautifulSoup(test.text, parse_only=SoupStrainer('a',{'href': not None})):

可以按需工作，除了它还返回Doctype:

That works as wanted Except that it also returns the Doctype:

HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"

由于与上述相同的原因，这也造成了麻烦.

Which is causing also trouble for the same reasons as above.

我的问题:为什么hasattr总是返回true，我该如何解决?如果hasattr不可能，那么我该如何解决不返回DOCTYPE的变通办法?

My question: Why does hasattr always returns true and how can I fix that? And if there is no possibility with hasattr, how can i fix my workaround that its not returning the DOCTYPE?

非常感谢和问候！

推荐答案

hasattr()是错误测试；它会测试是否存在a.href属性，BeautifulSoup会动态地将属性转换为对子级的搜索. HTML标记属性不会不转换为Python属性.

hasattr() is the wrong test; it tests if there is a a.href attribute, and BeautifulSoup dynamically turns attributes into searches for children. HTML tag attributes are not translated into Python attributes.

改为使用字典式测试；您遍历可能包含DocType实例的所有元素，因此我使用getattr()不会破坏没有属性的对象:

Use dictionary-style testing instead; you loop over all elements which can include the DocType instance, so I use getattr() to not break on objects that don't have attributes:

if 'href' in getattr(link, 'attrs', {}):

您还可以通过将href=True用作

You can also instruct SoupStrainer to only match a tags with a href attribute by using href=True as a keyword argument filter (not None just means True in any case):

for link in BeautifulSoup(test.text, parse_only=SoupStrainer('a', href=True)):

这当然还包括HTML声明；仅搜索a链接:

This still includes the HTML declaration of course; search for just a links:

soup = BeautifulSoup(test.text, parse_only=SoupStrainer('a', href=True))
for link in soup.find_all('a'):
    print link

这篇关于用于获取链接的Beautifulsoup和Soupstrainer不适用于hasattr，始终返回true的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用于获取链接的Beautifulsoup和Soupstrainer不适用于hasattr，始终返回true [英] Beautifulsoup and Soupstrainer for getting links dont work with hasattr, returning always true

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用于获取链接的Beautifulsoup和Soupstrainer不适用于hasattr，始终返回true [英] Beautifulsoup and Soupstrainer for getting links dont work with hasattr, returning always true

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭