在Beautifulsoup Python中排除不需要的标记 [英] Exclude unwanted tag on Beautifulsoup Python

查看：397 发布时间：2018/6/15 13:38:19 python html web-scraping beautifulsoup

本文介绍了在Beautifulsoup Python中排除不需要的标记的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

 <跨度> 
我喜欢
< span class ='unwanted'>打孔< / span> 
你的脸
< / span>

如何打印我喜欢你的脸而不是我喜欢揍你的脸 p>

我试过这个

  lala = soup.find_all（'span'） 
 for lala：
如果不是p.find（class _ ='unwanted'）：
 print p.text 
  TypeError：find（）不带关键字参数 
 
解决方案 
 $ b  / div> 
在获取文本之前，您可以使用 extract（）删除不需要的标记。 
 
 
 但它保留所有'\ n'和空格
  data ='''< span> 
我喜欢
< span class ='unwanted'>打孔< / span> 
您的脸
< span>'''
 
 from bs4 import BeautifulSoup as BS 
 
 soup = BS（data，'html.parser' ）
 
 external_span = soup.find（'span'）
 
 print（1 HTML：，external_span）
 print（1 TEXT：，external_span （'2 HTML：，.text.strip（））
 
 unwanted = external_span.find（'span'）
 unwanted.extract（）
 
 print external_span）
 print（2 TEXT：，external_span.text.strip（））
  
结果
  1 HTML：< span> 
我喜欢
< span class =unwanted>打孔< / span> 
您的脸
< span>< / span>< / span> 
 1 TEXT：我喜欢
来冲击
你的脸
 2 HTML：< span> 
我喜欢
 
你的脸
< span>< / span>< / span> 
 2 TEXT：我喜欢
 
你的脸
  
 
 
 
 
 
 您可以跳过外部范围内的每个标记对象并仅保留 NavigableString 对象（它是HTML中的纯文本）。
  data ='''< span> 
我喜欢
< span class ='unwanted'>打孔< / span> 
你的脸
< span>'''
 
 from bs4 import BeautifulSoup as BS 
 import bs4 
 
 soup = BS（data ，'html.parser'）
 
 external_span = soup.find（'span'）
 
 text = [] 
 for external_span：
如果isinstance（x，bs4.element.NavigableString）：
 text.append（x.strip（））
 print（.join（text））
  
 
 
 结果 
 
 我喜欢你的脸
  
 
<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 </span>
How to print "I Like your face" instead of "I Like to punch your face"


I tried this
lala = soup.find_all('span')
for p in lala:
 if not p.find(class_='unwanted'):
    print p.text
but it give
    "TypeError: find() takes no keyword arguments" 
 解决方案 
You can use extract() to remove unwanted tag before you get text. 

But it keeps all '\n' and spaces so you will need some work to remove them.
data = '''<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>'''

from bs4 import BeautifulSoup as BS

soup = BS(data, 'html.parser')

external_span = soup.find('span')

print("1 HTML:", external_span)
print("1 TEXT:", external_span.text.strip())

unwanted = external_span.find('span')
unwanted.extract()

print("2 HTML:", external_span)
print("2 TEXT:", external_span.text.strip())
Result
1 HTML: <span>
  I Like
  <span class="unwanted"> to punch </span>
   your face
 <span></span></span>
1 TEXT: I Like
   to punch 
   your face
2 HTML: <span>
  I Like

   your face
 <span></span></span>
2 TEXT: I Like

   your face




You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).
data = '''<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>'''

from bs4 import BeautifulSoup as BS
import bs4

soup = BS(data, 'html.parser')

external_span = soup.find('span')

text = []
for x in external_span:
    if isinstance(x, bs4.element.NavigableString):
        text.append(x.strip())
print(" ".join(text))
Result
I Like your face


                        
这篇关于在Beautifulsoup Python中排除不需要的标记的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
                        
                    

                    
                        查看全文

在Beautifulsoup Python中排除不需要的标记 [英] Exclude unwanted tag on Beautifulsoup Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

在Beautifulsoup Python中排除不需要的标记 [英] Exclude unwanted tag on Beautifulsoup Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭