BeautifulSoup,在HTML标签,ResultSet对象中提取字符串 [英] BeautifulSoup, extracting strings within HTML tags, ResultSet objects

查看:120
本文介绍了BeautifulSoup,在HTML标签,ResultSet对象中提取字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对如何将ResultSet对象与BeautifulSoup(即bs4.element.ResultSet)完全感到困惑.

I am confused exactly how I can use the ResultSet object with BeautifulSoup, i.e. bs4.element.ResultSet.

使用find_all()后,如何提取文本?

After using find_all(), how can one extract text?

示例:

bs4文档中,HTML文档html_doc看起来像:

In the bs4 documentation, the HTML document html_doc looks like:

<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>

首先创建soup并找到所有href

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')

输出

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

我们也可以做

for link in soup.find_all('a'):
    print(link.get('href'))

输出

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

我想只从class_="sister"中获取文本,即

I would like to get only the text from the class_="sister", i.e.

Elsie
Lacie
Tillie

一个人可以尝试

for link in soup.find_all('a'):
    print(link.get_text())

但这会导致错误:

AttributeError: 'ResultSet' object has no attribute 'get_text'

推荐答案

class_='sister'进行find_all()过滤.

注意: 注意class之后的下划线.这是一种特殊情况,因为class是保留字.

Note: Notice the underscore after class. It's a special case because class is a reserved word.

搜索具有特定CSS类的标签非常有用,但是 CSS属性的名称"class"是Python中的保留字. 使用class作为关键字参数会给您带来语法错误.作为 Beautiful Soup 4.1.2,您可以使用关键字按CSS类进行搜索 参数class_:

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, "class", is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

来源: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

一旦所有标签都带有classister类,请在它们上调用.text以获取文本.确保删除文本.

Once you have all of the tags with class sister, call .text on them to get the text. Be sure to strip the text.

例如:

from bs4 import BeautifulSoup

html_doc = '''<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>'''

soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
    print tag.text.strip()

输出:

(bs4)macbook:bs4 joeyoung$ python bs4demo.py
Elsie
Lacie
Tillie

这篇关于BeautifulSoup,在HTML标签,ResultSet对象中提取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆