在Beautifulsoup Python中排除不需要的标记 [英] Exclude unwanted tag on Beautifulsoup Python
本文介绍了在Beautifulsoup Python中排除不需要的标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
<跨度>
我喜欢
< span class ='unwanted'>打孔< / span>
你的脸
< / span>
如何打印我喜欢你的脸而不是我喜欢揍你的脸 p>
我试过这个
lala = soup.find_all('span')
$ p $但是它给了
for lala:
如果不是p.find(class _ ='unwanted'):
print p.text
TypeError:find()不带关键字参数
解决方案
$ b/ div>
在获取文本之前,您可以使用
extract()
删除不需要的标记。
但它保留所有
'\ n'
和空格$ c $因此您需要一些工作才能将其删除。
data ='''< span>
我喜欢
< span class ='unwanted'>打孔< / span>
您的脸
< span>'''
from bs4 import BeautifulSoup as BS
soup = BS(data,'html.parser' )
external_span = soup.find('span')
print(1 HTML:,external_span)
print(1 TEXT:,external_span ('2 HTML:,.text.strip())
unwanted = external_span.find('span')
unwanted.extract()
print external_span)
print(2 TEXT:,external_span.text.strip())
结果
1 HTML:< span>
我喜欢
< span class =unwanted>打孔< / span>
您的脸
< span>< / span>< / span>
1 TEXT:我喜欢
来冲击
你的脸
2 HTML:< span>
我喜欢
你的脸
< span>< / span>< / span>
2 TEXT:我喜欢
你的脸
您可以跳过外部范围内的每个
标记
对象并仅保留NavigableString
对象(它是HTML中的纯文本)。data ='''< span>
我喜欢
< span class ='unwanted'>打孔< / span>
你的脸
< span>'''
from bs4 import BeautifulSoup as BS
import bs4
soup = BS(data ,'html.parser')
external_span = soup.find('span')
text = []
for external_span:
如果isinstance(x,bs4.element.NavigableString):
text.append(x.strip())
print(.join(text))
结果
我喜欢你的脸
<span> I Like <span class='unwanted'> to punch </span> your face </span>
How to print "I Like your face" instead of "I Like to punch your face"
I tried this
lala = soup.find_all('span') for p in lala: if not p.find(class_='unwanted'): print p.text
but it give "TypeError: find() takes no keyword arguments"
解决方案You can use
extract()
to remove unwanted tag before you get text.But it keeps all
'\n'
andspaces
so you will need some work to remove them.data = '''<span> I Like <span class='unwanted'> to punch </span> your face <span>''' from bs4 import BeautifulSoup as BS soup = BS(data, 'html.parser') external_span = soup.find('span') print("1 HTML:", external_span) print("1 TEXT:", external_span.text.strip()) unwanted = external_span.find('span') unwanted.extract() print("2 HTML:", external_span) print("2 TEXT:", external_span.text.strip())
Result
1 HTML: <span> I Like <span class="unwanted"> to punch </span> your face <span></span></span> 1 TEXT: I Like to punch your face 2 HTML: <span> I Like your face <span></span></span> 2 TEXT: I Like your face
You can skip every
Tag
object inside external span and keep onlyNavigableString
objects (it is plain text in HTML).data = '''<span> I Like <span class='unwanted'> to punch </span> your face <span>''' from bs4 import BeautifulSoup as BS import bs4 soup = BS(data, 'html.parser') external_span = soup.find('span') text = [] for x in external_span: if isinstance(x, bs4.element.NavigableString): text.append(x.strip()) print(" ".join(text))
Result
I Like your face
这篇关于在Beautifulsoup Python中排除不需要的标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文