如何在Python中使用BeautifulSoup提取标签内的文本? [英] How to extract the text inside a tag with BeautifulSoup in Python?
本文介绍了如何在Python中使用BeautifulSoup提取标签内的文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
假设我有一个类似这样的html字符串:
Supposing I have an html string like this:
<html>
<div id="d1">
Text 1
</div>
<div id="d2">
Text 2
<a href="http://my.url/">a url</a>
Text 2 continue
</div>
<div id="d3">
Text 3
</div>
</html>
我想提取d2
的内容,该内容不由其他标签包裹,请跳过a url
.换句话说,我想得到这样的结果:
I want to extract the content of d2
that is NOT wrapped by other tags, skipping a url
. In other words I want to get such result:
Text 2
Text 2 continue
是否可以使用BeautifulSoup做到这一点?
Is there a way to do it with BeautifulSoup?
我尝试过,但这是不正确的:
I tried this, but it is not correct:
soup = BeautifulSoup(html_doc, 'html.parser')
s = soup.find(id='d2').text
print(s)
推荐答案
尝试使用.find_all(text=True, recursive=False)
:
from bs4 import BeautifulSoup
div_test="""
<html>
<div id="d1">
Text 1
</div>
<div id="d2">
Text 2
<a href="http://my.url/">a url</a>
Text 2 continue
</div>
<div id="d3">
Text 3
</div>
</html>
"""
soup = BeautifulSoup(div_test, 'lxml')
s = soup.find(id='d2').find_all(text=True, recursive=False)
print(s)
print([e.strip() for e in s]) #remove space
它将返回仅包含text
的list
:
[u'\n Text 2\n ', u'\n Text 2 continue\n ']
[u'Text 2', u'Text 2 continue']
这篇关于如何在Python中使用BeautifulSoup提取标签内的文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文