Python + BeautifulSoup:如何获取"a"元素的"href"属性? [英] Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

查看:477
本文介绍了Python + BeautifulSoup:如何获取"a"元素的"href"属性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下内容:

  html =
  '''<div class="file-one">
    <a href="/file-one/additional" class="file-link">
      <h3 class="file-name">File One</h3>
    </a>
    <div class="location">
      Down
    </div>
  </div>'''

,并且只想获取href的文本,即/file-one/additional.所以我做到了:

And would like to get just the text of href which is /file-one/additional. So I did:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = ""

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print "Link: " + link_text

但是它只打印空白,什么也不打印.只是Link:.因此,我在另一个网站上使用不同的HTML对其进行了测试,并且可以正常工作.

But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.

我可能做错了什么?还是站点被故意编程为不返回href的可能性?

What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?

预先感谢您,一定会赞成/接受答案!

Thank you in advance and will be sure to upvote/accept answer!

推荐答案

您html中的'a'标记不直接包含任何文本,但是包含一个包含文本的'h3'标记.这意味着text为None,并且.find_all()无法选择标签.如果标记包含除文本内容以外的任何其他html元素,通常不要使用text参数.

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.

如果仅使用标签名称(和href关键字参数)来选择元素,则可以解决此问题.然后在循环中添加一个条件,以检查它们是否包含文本.

You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

或者,如果您喜欢单线,也可以使用列表推导.

Or you could use a list comprehension, if you prefer one-liners.

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

或者您可以将 lambda 传递给.find_all().

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)


如果要收集所有链接(无论它们是否包含文本),只需选择所有具有'href'属性的'a'标签.锚标记通常具有链接,但这不是必需的,因此我认为最好使用href参数.


If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.

使用.find_all().

links = [a['href'] for a in soup.find_all('a', href=True)]

在CSS选择器中使用.select().

Using .select() with CSS selectors.

links = [a['href'] for a in soup.select('a[href]')]

这篇关于Python + BeautifulSoup:如何获取"a"元素的"href"属性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆