Python的beautifulsoup试图删除HTML标签“跨度” [英] Python beautifulsoup trying to remove html tags 'span'
本文介绍了Python的beautifulsoup试图删除HTML标签“跨度”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我试图删除
[<span class="street-address">
510 E Airline Way
</span>]
和我已经使用这个清洗功能去掉一个是两者之间&LT; &GT;
and I have used this clean function to remove the one that is in between < >
def clean(val):
if type(val) is not StringType: val = str(val)
val = re.sub(r'<.*?>', '',val)
val = re.sub("\s+" , " ", val)
return val.strip()
和它产生的 [510ë航空公司路]
我试图在干净的功能添加到删除字符'['
和]
基本上我只是想获得510ê航空路
。
i am trying to add within "clean" function to remove the char '['
and ']'
and basically i just want to get the "510 E Airline Way"
.
任何人有任何线索,我可以添加到清洁
功能?
anyone has any clue what can i add to clean
function?
感谢您
推荐答案
使用回复:
>>> import re
>>> s='[<span class="street-address">\n 510 E Airline Way\n </span>]'
>>> re.sub(r'\[|\]|\s*<[^>]*>\s*', '', s)
'510 E Airline Way'
使用BeautifulSoup:
Using BeautifulSoup:
>>> from BeautifulSoup import BeautifulSoup
>>> s='[<span class="street-address">\n 510 E Airline Way\n </span>]'
>>> b = BeautifulSoup(s)
>>> b.find('span').getText()
u'510 E Airline Way'
使用lxml的:
Using lxml:
>>> from lxml import html
>>> s='[<span class="street-address">\n 510 E Airline Way\n </span>]'
>>> h = html.document_fromstring(s)
>>> h.cssselect('span')[0].text.strip()
'510 E Airline Way'
这篇关于Python的beautifulsoup试图删除HTML标签“跨度”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文