Python + BeautifulSoup:如何获取"a"元素的"href"属性? [英] Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

查看：477 发布时间：2020/9/20 6:13:37 python html web-scraping beautifulsoup

本文介绍了Python + BeautifulSoup:如何获取"a"元素的"href"属性?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下内容:

  html =
  '''<div class="file-one">
    <a href="/file-one/additional" class="file-link">
      <h3 class="file-name">File One</h3>
    </a>
    <div class="location">
      Down
    </div>
  </div>'''

，并且只想获取href的文本，即/file-one/additional.所以我做到了:

And would like to get just the text of href which is /file-one/additional. So I did:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

link_text = ""

for a in soup.find_all(‘a’, href=True, text=True):
    link_text = a[‘href’]

print "Link: " + link_text

但是它只打印空白，什么也不打印.只是Link:.因此，我在另一个网站上使用不同的HTML对其进行了测试，并且可以正常工作.

But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.

我可能做错了什么?还是站点被故意编程为不返回href的可能性?

What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?

预先感谢您，一定会赞成/接受答案！

Thank you in advance and will be sure to upvote/accept answer!

推荐答案

您html中的'a'标记不直接包含任何文本，但是包含一个包含文本的'h3'标记.这意味着text为None，并且.find_all()无法选择标签.如果标记包含除文本内容以外的任何其他html元素，通常不要使用text参数.

The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.

如果仅使用标签名称(和href关键字参数)来选择元素，则可以解决此问题.然后在循环中添加一个条件，以检查它们是否包含文本.

You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.

soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True): 
    if a.text: 
        links_with_text.append(a['href'])

或者，如果您喜欢单线，也可以使用列表推导.

Or you could use a list comprehension, if you prefer one-liners.

links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]

或者您可以将 lambda 传递给.find_all().

tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)

如果要收集所有链接(无论它们是否包含文本)，只需选择所有具有'href'属性的'a'标签.锚标记通常具有链接，但这不是必需的，因此我认为最好使用href参数.

If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.

使用.find_all().

links = [a['href'] for a in soup.find_all('a', href=True)]

在CSS选择器中使用.select().

Using .select() with CSS selectors.

links = [a['href'] for a in soup.select('a[href]')]

这篇关于Python + BeautifulSoup:如何获取"a"元素的"href"属性?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python + BeautifulSoup:如何获取"a"元素的"href"属性? [英] Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

Python + BeautifulSoup:如何获取"a"元素的"href"属性? [英] Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭