Python的beautifulsoup试图删除HTML标签“跨度” [英] Python beautifulsoup trying to remove html tags 'span'

查看:147
本文介绍了Python的beautifulsoup试图删除HTML标签“跨度”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图删除

[<span class="street-address">
            510 E Airline Way
           </span>]

和我已经使用这个清洗功能去掉一个是两者之间&LT; &GT;

and I have used this clean function to remove the one that is in between < >

def clean(val):
 if type(val) is not StringType: val = str(val)
 val = re.sub(r'<.*?>', '',val) 
 val = re.sub("\s+" , " ", val)
 return val.strip()

和它产生的 [510ë航空公司路]

我试图在干净的功能添加到删除字符'[']基本上我只是想获得510ê航空路

i am trying to add within "clean" function to remove the char '[' and ']' and basically i just want to get the "510 E Airline Way".

任何人有任何线索,我可以添加到清洁功能?

anyone has any clue what can i add to clean function?

感谢您

推荐答案

使用回复:

>>> import re
>>> s='[<span class="street-address">\n            510 E Airline Way\n           </span>]'
>>> re.sub(r'\[|\]|\s*<[^>]*>\s*', '', s)
'510 E Airline Way'

使用BeautifulSoup:

Using BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> s='[<span class="street-address">\n            510 E Airline Way\n           </span>]'
>>> b = BeautifulSoup(s)
>>> b.find('span').getText()
u'510 E Airline Way'

使用lxml的:

Using lxml:

>>> from lxml import html
>>> s='[<span class="street-address">\n            510 E Airline Way\n           </span>]'
>>> h = html.document_fromstring(s)
>>> h.cssselect('span')[0].text.strip()
'510 E Airline Way'

这篇关于Python的beautifulsoup试图删除HTML标签“跨度”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆