BeautifulSoup:剥离指定的属性,但保留标签及其内容 [英] BeautifulSoup: Strip specified attributes, but preserve the tag and its contents
问题描述
我正在尝试"defrontpagify" MS FrontPage生成的网站的html,并且我正在编写BeautifulSoup脚本来做到这一点.
I'm trying to 'defrontpagify' the html of a MS FrontPage generated website, and I'm writing a BeautifulSoup script to do it.
但是,我被困在试图从包含它们的文档中的每个标签中剥离特定属性(或列表属性)的部分.代码段:
However, I've gotten stuck on the part where I try to strip a particular attribute (or list attributes) from every tag in the document that contains them. The code snippet:
REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
'dir','face','size','color','style','class','width','height','hspace',
'border','valign','align','background','bgcolor','text','link','vlink',
'alink','cellpadding','cellspacing']
# remove all attributes in REMOVE_ATTRIBUTES from all tags,
# but preserve the tag and its content.
for attribute in REMOVE_ATTRIBUTES:
for tag in soup.findAll(attribute=True):
del(tag[attribute])
它运行时没有错误,但实际上并没有剥离任何属性.当我在没有外部循环的情况下运行它时,只需对单个属性(soup.findAll('style'= True)进行硬编码,即可使用.
It runs without error, but doesn't actually strip any of the attributes. When I run it without the outer loop, just hard coding a single attribute (soup.findAll('style'=True), it works.
有人看到这里的问题吗?
Anyone see know the problem here?
PS-我也不太喜欢嵌套循环.如果有人知道更实用的地图/过滤样式,我很乐意看到它.
PS - I don't much like the nested loops either. If anyone knows a more functional, map/filter-ish style, I'd love to see it.
推荐答案
行
for tag in soup.findAll(attribute=True):
找不到任何tag
.可能有一种使用findAll
的方法.我不知道.但是,这可行:
does not find any tag
s. There might be a way to use findAll
; I'm not sure. However, this works:
import BeautifulSoup
REMOVE_ATTRIBUTES = [
'lang','language','onmouseover','onmouseout','script','style','font',
'dir','face','size','color','style','class','width','height','hspace',
'border','valign','align','background','bgcolor','text','link','vlink',
'alink','cellpadding','cellspacing']
doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
try:
tag.attrs = [(key,value) for key,value in tag.attrs
if key not in REMOVE_ATTRIBUTES]
except AttributeError:
# 'NavigableString' object has no attribute 'attrs'
pass
print(soup.prettify())
这篇关于BeautifulSoup:剥离指定的属性,但保留标签及其内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!