获取没有内部子标签文本的 HTML 标签文本 [英] Get text of HTML tags without text of inner child tags
问题描述
示例:
有时 HTML 是:
<div id="2">这是我不想要的文字这是我想要的文字
其他时候只是:
这是我想要的文字
我只想获取一个标签中的文本,而忽略所有其他子标签.如果我运行 .text
属性,我会得到两个.
更新以使用更通用的方法(请参阅编辑历史以获取原始答案):
您可以通过测试它们是否是NavigableString
.
from bs4 import BeautifulSoup, NavigableStringhtml = '''<div id="2">这是我不想要的文字这是我想要的文字</div>'''汤 = BeautifulSoup(html)外 = 汤.div内部文本 = [外部元素的元素 if isinstance(element, NavigableString)]
这会生成一个包含在外部 div 元素中的字符串列表.
<预><代码>>>>内文[u' ', u' 这是我想要的文字 ']>>>''.join(inner_text)你' 这是我想要的文字 '对于你的第二个例子:
html = '''这是我想要的文字</div>'''汤2 = BeautifulSoup(html)外 = 汤2.div内部文本 = [外部元素的元素 if isinstance(element, NavigableString)]>>>内文[u'
这是我想要的文字
']这也适用于其他情况,例如外部 div 的文本元素存在于任何子标签之前、子标签之间、多个文本元素或根本不存在.
Example:
Sometimes the HTML is:
<div id="1">
<div id="2">
this is the text i do NOT want
</div>
this is the text i want here
</div>
Other times it's just:
<div id="1">
this is the text i want here
</div>
I want to get only the text in the one tag, and ignore all other child tags. If I run the .text
property, I get both.
解决方案 Updated to use a more generic method (see edit history for original answer):
You can extract child elements of the outer div by testing whether they are instances of NavigableString
.
from bs4 import BeautifulSoup, NavigableString
html = '''<div id="1">
<div id="2">
this is the text i do NOT want
</div>
this is the text i want here
</div>'''
soup = BeautifulSoup(html)
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]
This results in a list of strings contained in the outer div element.
>>> inner_text
[u'
', u'
this is the text i want here
']
>>> ''.join(inner_text)
u'
this is the text i want here
'
For your second example:
html = '''<div id="1">
this is the text i want here
</div>'''
soup2 = BeautifulSoup(html)
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]
>>> inner_text
[u'
this is the text i want here
']
This will also work for other cases such as the outer div's text element being present before any child tags, between child tags, multiple text elements, or not present at all.
这篇关于获取没有内部子标签文本的 HTML 标签文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文