如何使用 BeautifulSoup 删除嵌套标签中的内容? [英] How to remove content in nested tags with BeautifulSoup?

查看:21
本文介绍了如何使用 BeautifulSoup 删除嵌套标签中的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用 BeautifulSoup 删除嵌套标签中的内容?这些帖子显示反向检索嵌套标签中的内容:How使用 BeautifulSoupBeautifulSoup:如何从包含一些嵌套

      列表中提取所有
    • ?>

      我试过 .text 但它只删除标签

      <预><代码>>>>从 bs4 导入 BeautifulSoup 作为 bs>>>html = "<foo>某事<bar>等等</bar>某事</foo>">>>bs(html).find_all('foo')[0]<foo>东西<bar>等等等等</bar>别的东西>>>bs(html).find_all('foo')[0].text你'某事某事等等等等'

      所需的输出:

      <块引用>

      别的东西

      解决方案

      您可以检查孩子的 bs4.element.NavigableString :

      from bs4 import BeautifulSoup as bs进口BS4html = "<foo>某事<bar>等等</bar>某事<bar2>GONE!</bar2> else</foo>"def get_only_text(elem):对于 elem.children 中的项目:if isinstance(item,bs4.element.NavigableString):产量项目打印 ''.join(get_only_text(bs(html).find_all('foo')[0]))

      输出;

      东西别的东西

      How to remove content in nested tags with BeautifulSoup? These posts showed the reverse to retrieve the content in nested tags: How to get contents of nested tag using BeautifulSoup, and BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

      I have tried .text but it only removes the tags

      >>> from bs4 import BeautifulSoup as bs
      >>> html = "<foo>Something something <bar> blah blah</bar> something</foo>"
      >>> bs(html).find_all('foo')[0]
      <foo>Something something <bar> blah blah</bar> something else</foo>
      >>> bs(html).find_all('foo')[0].text
      u'Something something  blah blah something else'
      

      Desired output:

      Something something something else

      解决方案

      You can check for bs4.element.NavigableString on children:

      from bs4 import BeautifulSoup as bs
      import bs4
      html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
      def get_only_text(elem):
          for item in elem.children:
              if isinstance(item,bs4.element.NavigableString):
                  yield item
      
      print ''.join(get_only_text(bs(html).find_all('foo')[0]))
      

      Output;

      Something something  something  else
      

      这篇关于如何使用 BeautifulSoup 删除嵌套标签中的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆