获取HTML标签的文本,而没有内部子标签的文本 [英] Get text of HTML tags without text of inner child tags

查看:135
本文介绍了获取HTML标签的文本,而没有内部子标签的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

示例:

有时HTML是:

<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>

其他时候只是:

<div id="1">
    this is the text i want here
</div>

我只想获取一个标签中的文本,而忽略所有其他子标签.如果运行.text属性,则两者都会得到.

I want to get only the text in the one tag, and ignore all other child tags. If I run the .text property, I get both.

推荐答案

已更新以使用更通用的方法(有关原始答案,请参见编辑历史记录):

Updated to use a more generic method (see edit history for original answer):

您可以通过测试外部div的子元素是否是 NavigableString .

You can extract child elements of the outer div by testing whether they are instances of NavigableString.

from bs4 import BeautifulSoup, NavigableString

html = '''<div id="1">
    <div id="2">
        this is the text i do NOT want
    </div>
    this is the text i want here
</div>'''

soup = BeautifulSoup(html)    
outer = soup.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

这将导致外部div元素中包含一个字符串列表.

This results in a list of strings contained in the outer div element.

>>> inner_text
[u'\n', u'\n    this is the text i want here\n']
>>> ''.join(inner_text)
u'\n\n    this is the text i want here\n'

第二个例子:

html = '''<div id="1">
    this is the text i want here
</div>'''
soup2 = BeautifulSoup(html)    
outer = soup2.div
inner_text = [element for element in outer if isinstance(element, NavigableString)]

>>> inner_text
[u'\n    this is the text i want here\n']

这也适用于其他情况,例如外部div的text元素在任何子标签之前,在子标签之间,多个文本元素之间或根本不存在.

This will also work for other cases such as the outer div's text element being present before any child tags, between child tags, multiple text elements, or not present at all.

这篇关于获取HTML标签的文本,而没有内部子标签的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆