如何在Python中使用BeautifulSoup提取标签内的文本? [英] How to extract the text inside a tag with BeautifulSoup in Python?

查看:104
本文介绍了如何在Python中使用BeautifulSoup提取标签内的文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个类似这样的html字符串:

Supposing I have an html string like this:

<html>
    <div id="d1">
        Text 1
    </div>
    <div id="d2">
        Text 2
        <a href="http://my.url/">a url</a>
        Text 2 continue
    </div>
    <div id="d3">
        Text 3
    </div>
</html>

我想提取d2的内容,该内容由其他标签包裹,请跳过a url.换句话说,我想得到这样的结果:

I want to extract the content of d2 that is NOT wrapped by other tags, skipping a url. In other words I want to get such result:

Text 2
Text 2 continue

是否可以使用BeautifulSoup做到这一点?

Is there a way to do it with BeautifulSoup?

我尝试过,但这是不正确的:

I tried this, but it is not correct:

soup = BeautifulSoup(html_doc, 'html.parser')
s = soup.find(id='d2').text
print(s)

推荐答案

尝试使用.find_all(text=True, recursive=False):

from bs4 import BeautifulSoup
div_test="""
<html>
    <div id="d1">
        Text 1
    </div>
    <div id="d2">
        Text 2
        <a href="http://my.url/">a url</a>
        Text 2 continue
    </div>
    <div id="d3">
        Text 3
    </div>
</html>
"""
soup = BeautifulSoup(div_test, 'lxml')
s = soup.find(id='d2').find_all(text=True, recursive=False)
print(s)
print([e.strip() for e in s]) #remove space

它将返回仅包含textlist:

[u'\n        Text 2\n        ', u'\n        Text 2 continue\n    ']
[u'Text 2', u'Text 2 continue']

这篇关于如何在Python中使用BeautifulSoup提取标签内的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆