Python + BeautifulSoup:如何获取"a"元素的"href"属性? [英] Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?
问题描述
我有以下内容:
html =
'''<div class="file-one">
<a href="/file-one/additional" class="file-link">
<h3 class="file-name">File One</h3>
</a>
<div class="location">
Down
</div>
</div>'''
,并且只想获取href
的文本,即/file-one/additional
.所以我做到了:
And would like to get just the text of href
which is /file-one/additional
. So I did:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
link_text = ""
for a in soup.find_all(‘a’, href=True, text=True):
link_text = a[‘href’]
print "Link: " + link_text
但是它只打印空白,什么也不打印.只是Link:
.因此,我在另一个网站上使用不同的HTML对其进行了测试,并且可以正常工作.
But it just prints a blank, nothing. Just Link:
. So I tested it out on another site but with a different HTML, and it worked.
我可能做错了什么?还是站点被故意编程为不返回href
的可能性?
What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href
?
预先感谢您,一定会赞成/接受答案!
Thank you in advance and will be sure to upvote/accept answer!
推荐答案
您html中的'a'标记不直接包含任何文本,但是包含一个包含文本的'h3'标记.这意味着text
为None,并且.find_all()
无法选择标签.如果标记包含除文本内容以外的任何其他html元素,通常不要使用text
参数.
The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text
is None, and .find_all()
fails to select the tag. Generally do not use the text
parameter if a tag contains any other html elements except text content.
如果仅使用标签名称(和href
关键字参数)来选择元素,则可以解决此问题.然后在循环中添加一个条件,以检查它们是否包含文本.
You can resolve this issue if you use only the tag's name (and the href
keyword argument) to select elements. Then add a condition in the loop to check if they contain text.
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
或者,如果您喜欢单线,也可以使用列表推导.
Or you could use a list comprehension, if you prefer one-liners.
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
或者您可以将 lambda
传递给.find_all()
.
tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)
如果要收集所有链接(无论它们是否包含文本),只需选择所有具有'href'属性的'a'标签.锚标记通常具有链接,但这不是必需的,因此我认为最好使用href
参数.
If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href
argument.
使用.find_all()
.
links = [a['href'] for a in soup.find_all('a', href=True)]
在CSS选择器中使用.select()
.
Using .select()
with CSS selectors.
links = [a['href'] for a in soup.select('a[href]')]
这篇关于Python + BeautifulSoup:如何获取"a"元素的"href"属性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!