BeautifulSoup删除标签属性和文本内容 [英] BeautifulSoup remove tag attributes and text contents

查看:182
本文介绍了BeautifulSoup删除标签属性和文本内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据某些网页的整体DOM结构而不是其特定内容来进行比较.为此,我需要一种类似于标签层次结构但不包括属性或文本标签内容的表示形式.

I want to compare some webpages based on their overall DOM structure but not their particular contents. To this end i need a representation that resembles the tag hierachy but does not include attributes or textual tag-contents.

基本上,我想转成这样的表示形式

Basically, I want to turn a representation like this

<!DOCTYPE html>
<html>
<body>

<h1 id="peter">My First Heading</h1>
<p><span style="color:red">My</span> first paragraph.</p>

<img src="peter.jpg" />

</body>
</html>

变成这样的标准裸机表示:

into a canoncial baremetal representation like this:

<html><body><h1></h1><p><span></span></p><img/></body></html>

即所有属性都被删除,并且标签内容不是其他标签.

i.e. all attributes removed, as well as tag contents that are not other tags.

我找到了一种从标记中删除属性的方法,但是在区分文本子节点和标记子节点时遇到了问题.

I found a way to remove attributes from tags, but im having problems differentiation between text child nodes and tag child nodes.

推荐答案

作为

您无法就地编辑字符串,但是可以使用replace_with()将一个字符串替换为另一个字符串

You can’t edit a string in place, but you can replace one string with another, using replace_with()

所以我会选择这样的东西(假设soup正是您发布的内容):

so I would go for something like this (assume soup is exactly what you posted):

for e in soup.find_all(True):
    e.attrs = {}

    for i in e.contents:
        if i.string:
            i.string.replace_with('') 

我认为,如果一个标签有一个以上的孩子,其中一个是文本,而另一个是另一个包含文本的标签,那么如果不遍历每个标签的内容,您最终会剩下一些文本残留(如您的示例) <p><span style="color:red">My</span> first paragraph.</p>).

I think without looping into each tag's content you'll end up with some text leftovers in cases in which a tag has more than one child and one of them is text and another one is another tag containing text (as in your example <p><span style="color:red">My</span> first paragraph.</p>).

针对您的示例运行时:

(env) $ python strip.py                                                               
<!DOCTYPE html>

<html><body><h1></h1><p><span></span></p><img/></body></html>

(可以稍作更改,因此不会返回换行符或doctype)

(it can be changed a little so it doesn't return newlines or doctype)

这篇关于BeautifulSoup删除标签属性和文本内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆