如何从< a>中提取链接在< h2 class = section-heading>内:BeautifulSoup [英] How to extract link from <a> inside the <h2 class=section-heading>:BeautifulSoup
问题描述
我正在尝试提取这样写的链接:
I am trying to extract a link which is written like this:
<h2 class="section-heading">
<a href="http://www.nytimes.com/pages/arts/index.html">Arts »</a>
</h2>
我的代码是:
from bs4 import BeautifulSoup
import requests, re
def get_data():
url='http://www.nytimes.com/'
s_code=requests.get(url)
plain_text = s_code.text
soup = BeautifulSoup(plain_text)
head_links=soup.findAll('h2', {'class':'section-heading'})
for n in head_links :
a = n.find('a')
print a
print n.get['href']
#print a['href']
#print n.get('href')
#headings=n.text
#links = n.get('href')
#print headings, links
get_data()
像"print a"一样,只是打印出<h2 class=section-heading>
内的整个<a>
行,即
the like "print a" simply prints out the whole <a>
line inside the <h2 class=section-heading>
i.e.
<a href="http://www.nytimes.com/pages/world/index.html">World »</a>
但是当我执行"print n.get ['href']"时,会抛出一个错误;
but when I execute "print n.get['href']", it throws me an error;
print n.get['href']
TypeError: 'instancemethod' object has no attribute '__getitem__'
我在这里做错什么了吗?请帮助
Am I doing something wrong here? Please help
我在这里找不到类似的案例问题,我的问题在这里有点独特,我正在尝试提取特定类名部分标题中的链接.
I couldn't find some similar case question here, my issue is a bit unique here, I am trying to extract a link that is inside a specific class names section-headings.
推荐答案
首先,您要获取a
元素的href
,因此您应该在该元素上访问a
而不是n
线.其次,应该是
First of all, you want to fetch the href
of the a
element, thus you should be accessing a
not n
on that line. Secondly, it should be either
a.get('href')
或
a['href']
如果找不到这样的属性,则后者将引发,而前者将返回None
,就像通常的字典/映射接口一样.由于.get
是一种方法,因此应将其称为(.get(...)
);.索引/元素访问对它不起作用(.get[...]
),这就是这个问题.
The latter form throws if no such attribute is found, whereas the former would return None
, like the usual dictionary/mapping interface. As .get
is a method, it should be called (.get(...)
); indexing/element access wouldn't work for it (.get[...]
), which is what this question is about.
请注意,find
可能也在那里失败,返回None
,也许您想遍历n.find_all('a', href=True)
:
Notice, that find
might as well fail there, returning None
, perhaps you wanted to iterate over n.find_all('a', href=True)
:
for n in head_links:
for a in n.find_all('a', href=True):
print(a['href'])
使用select
方法(使用CSS选择器)比使用find_all
更容易.在这里,通过一次操作,我们只能像在JQuery中一样容易地获得位于<h2 class="section-heading">
内部的具有href
属性的<a>
元素.
Even easier than using find_all
is to use the select
method which takes a CSS selector. Here with a single operation we only get those <a>
elements with href
attribute that are inside a <h2 class="section-heading">
as easily as with JQuery.
soup = BeautifulSoup(plain_text)
for a in soup.select('h2.section-heading a[href]'):
print(a['href'])
(此外,请在您编写的任何新代码中使用小写的方法名称).
(Also, please use the lower-case method names in any new code that you write).
这篇关于如何从< a>中提取链接在< h2 class = section-heading>内:BeautifulSoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!