在Beautifulsoup Python中排除不需要的标记 [英] Exclude unwanted tag on Beautifulsoup Python

查看:397
本文介绍了在Beautifulsoup Python中排除不需要的标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 <跨度> 
我喜欢
< span class ='unwanted'>打孔< / span>
你的脸
< / span>

如何打印我喜欢你的脸而不是我喜欢揍你的脸 p>

我试过这个

  lala = soup.find_all('span') 
for lala:
如果不是p.find(class _ ='unwanted'):
print p.text
TypeError:find()不带关键字参数

解决方案
$ b

/ div>

在获取文本之前,您可以使用 extract()删除不需要的标记。



但它保留所有'\ n'空格

  data ='''< span> 
我喜欢
< span class ='unwanted'>打孔< / span>
您的脸
< span>'''

from bs4 import BeautifulSoup as BS

soup = BS(data,'html.parser' )

external_span = soup.find('span')

print(1 HTML:,external_span)
print(1 TEXT:,external_span ('2 HTML:,.text.strip())

unwanted = external_span.find('span')
unwanted.extract()

print external_span)
print(2 TEXT:,external_span.text.strip())

结果

  1 HTML:< span> 
我喜欢
< span class =unwanted>打孔< / span>
您的脸
< span>< / span>< / span>
1 TEXT:我喜欢
来冲击
你的脸
2 HTML:< span>
我喜欢

你的脸
< span>< / span>< / span>
2 TEXT:我喜欢

你的脸






您可以跳过外部范围内的每个标记对象并仅保留 NavigableString 对象(它是HTML中的纯文本)。

  data ='''< span> 
我喜欢
< span class ='unwanted'>打孔< / span>
你的脸
< span>'''

from bs4 import BeautifulSoup as BS
import bs4

soup = BS(data ,'html.parser')

external_span = soup.find('span')

text = []
for external_span:
如果isinstance(x,bs4.element.NavigableString):
text.append(x.strip())
print(.join(text))



结果

 我喜欢你的脸


<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 </span>

How to print "I Like your face" instead of "I Like to punch your face"

I tried this

lala = soup.find_all('span')
for p in lala:
 if not p.find(class_='unwanted'):
    print p.text

but it give "TypeError: find() takes no keyword arguments"

解决方案

You can use extract() to remove unwanted tag before you get text.

But it keeps all '\n' and spaces so you will need some work to remove them.

data = '''<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>'''

from bs4 import BeautifulSoup as BS

soup = BS(data, 'html.parser')

external_span = soup.find('span')

print("1 HTML:", external_span)
print("1 TEXT:", external_span.text.strip())

unwanted = external_span.find('span')
unwanted.extract()

print("2 HTML:", external_span)
print("2 TEXT:", external_span.text.strip())

Result

1 HTML: <span>
  I Like
  <span class="unwanted"> to punch </span>
   your face
 <span></span></span>
1 TEXT: I Like
   to punch 
   your face
2 HTML: <span>
  I Like

   your face
 <span></span></span>
2 TEXT: I Like

   your face


You can skip every Tag object inside external span and keep only NavigableString objects (it is plain text in HTML).

data = '''<span>
  I Like
  <span class='unwanted'> to punch </span>
   your face
 <span>'''

from bs4 import BeautifulSoup as BS
import bs4

soup = BS(data, 'html.parser')

external_span = soup.find('span')

text = []
for x in external_span:
    if isinstance(x, bs4.element.NavigableString):
        text.append(x.strip())
print(" ".join(text))

Result

I Like your face

这篇关于在Beautifulsoup Python中排除不需要的标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆