BeautifulSoup删除标签属性和文本内容 [英] BeautifulSoup remove tag attributes and text contents
问题描述
我想根据某些网页的整体DOM结构而不是其特定内容来进行比较.为此,我需要一种类似于标签层次结构但不包括属性或文本标签内容的表示形式.
I want to compare some webpages based on their overall DOM structure but not their particular contents. To this end i need a representation that resembles the tag hierachy but does not include attributes or textual tag-contents.
基本上,我想转成这样的表示形式
Basically, I want to turn a representation like this
<!DOCTYPE html>
<html>
<body>
<h1 id="peter">My First Heading</h1>
<p><span style="color:red">My</span> first paragraph.</p>
<img src="peter.jpg" />
</body>
</html>
变成这样的标准裸机表示:
into a canoncial baremetal representation like this:
<html><body><h1></h1><p><span></span></p><img/></body></html>
即所有属性都被删除,并且标签内容不是其他标签.
i.e. all attributes removed, as well as tag contents that are not other tags.
我找到了一种从标记中删除属性的方法,但是在区分文本子节点和标记子节点时遇到了问题.
I found a way to remove attributes from tags, but im having problems differentiation between text child nodes and tag child nodes.