BeautifulSoup,在HTML标签,ResultSet对象中提取字符串 [英] BeautifulSoup, extracting strings within HTML tags, ResultSet objects
问题描述
我对如何将ResultSet对象与BeautifulSoup(即bs4.element.ResultSet
)完全感到困惑.
I am confused exactly how I can use the ResultSet object with BeautifulSoup, i.e. bs4.element.ResultSet
.
使用find_all()
后,如何提取文本?
After using find_all()
, how can one extract text?
示例:
在bs4
文档中,HTML文档html_doc
看起来像:
In the bs4
documentation, the HTML document html_doc
looks like:
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>
首先创建soup
并找到所有href
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')
输出
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
我们也可以做
for link in soup.find_all('a'):
print(link.get('href'))
输出
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
我想只从class_="sister"
中获取文本,即
I would like to get only the text from the class_="sister"
, i.e.
Elsie
Lacie
Tillie
一个人可以尝试
for link in soup.find_all('a'):
print(link.get_text())
但这会导致错误:
AttributeError: 'ResultSet' object has no attribute 'get_text'
推荐答案
对class_='sister'
进行find_all()
过滤.
注意: 注意class
之后的下划线.这是一种特殊情况,因为class是保留字.
Note: Notice the underscore after class
. It's a special case because class is a reserved word.
搜索具有特定CSS类的标签非常有用,但是 CSS属性的名称"class"是Python中的保留字. 使用class作为关键字参数会给您带来语法错误.作为 Beautiful Soup 4.1.2,您可以使用关键字按CSS类进行搜索 参数
class_
:
It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, "class", is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument
class_
:
来源: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
一旦所有标签都带有classister类,请在它们上调用.text
以获取文本.确保删除文本.
Once you have all of the tags with class sister, call .text
on them to get the text. Be sure to strip the text.
例如:
from bs4 import BeautifulSoup
html_doc = '''<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>'''
soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
print tag.text.strip()
输出:
(bs4)macbook:bs4 joeyoung$ python bs4demo.py
Elsie
Lacie
Tillie
这篇关于BeautifulSoup,在HTML标签,ResultSet对象中提取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!