用beautifulsoup克隆元素 [英] clone element with beautifulsoup

查看:19
本文介绍了用beautifulsoup克隆元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须将一个文档的一部分复制到另一个文档,但我不想修改我从中复制的文档.

I have to copy a part of one document to another, but I don't want to modify the document I copy from.

如果我使用 .extract() 它会从树中删除元素.如果我只是附加像 document2.append(document1.tag) 这样的选定元素,它仍然会从 document1 中删除该元素.

If I use .extract() it removes the element from the tree. If I just append selected element like document2.append(document1.tag) it still removes the element from document1.

当我使用真实文件时,我不能在修改后保存 document1,但是有没有办法在不损坏文档的情况下做到这一点?

As I use real files I can just not save document1 after modification, but is there any way to do this without corrupting a document?

推荐答案

4.4(2015年7月发布)之前的BeautifulSoup没有原生克隆功能;您必须自己创建一个深层副本,这很棘手,因为每个元素都维护与树其余部分的链接.

There is no native clone function in BeautifulSoup in versions before 4.4 (released July 2015); you'd have to create a deep copy yourself, which is tricky as each element maintains links to the rest of the tree.

要克隆一个元素及其所有元素,您必须复制所有属性并重置它们的父子关系;这必须递归地发生.最好不要复制关系属性并重新设置每个递归克隆的元素:

To clone an element and all its elements, you'd have to copy all attributes and reset their parent-child relationships; this has to happen recursively. This is best done by not copying the relationship attributes and re-seat each recursively-cloned element:

from bs4 import Tag, NavigableString

def clone(el):
    if isinstance(el, NavigableString):
        return type(el)(el)

    copy = Tag(None, el.builder, el.name, el.namespace, el.nsprefix)
    # work around bug where there is no builder set
    # https://bugs.launchpad.net/beautifulsoup/+bug/1307471
    copy.attrs = dict(el.attrs)
    for attr in ('can_be_empty_element', 'hidden'):
        setattr(copy, attr, getattr(el, attr))
    for child in el.contents:
        copy.append(clone(child))
    return copy

这个方法对当前的 BeautifulSoup 版本有点敏感;我用4.3测试过,以后的版本可能也会增加需要复制的属性.

This method is kind-of sensitive to the current BeautifulSoup version; I tested this with 4.3, future versions may add attributes that need to be copied too.

您也可以将此功能添加到 BeautifulSoup 中:

You could also monkeypatch this functionality into BeautifulSoup:

from bs4 import Tag, NavigableString


def tag_clone(self):
    copy = type(self)(None, self.builder, self.name, self.namespace, 
                      self.nsprefix)
    # work around bug where there is no builder set
    # https://bugs.launchpad.net/beautifulsoup/+bug/1307471
    copy.attrs = dict(self.attrs)
    for attr in ('can_be_empty_element', 'hidden'):
        setattr(copy, attr, getattr(self, attr))
    for child in self.contents:
        copy.append(child.clone())
    return copy


Tag.clone = tag_clone
NavigableString.clone = lambda self: type(self)(self)

让你直接在元素上调用 .clone():

letting you call .clone() on elements directly:

document2.body.append(document1.find('div', id_='someid').clone())

我对 BeautifulSoup 项目的功能请求被接受和调整以使用copy.copy() 函数;现在,BeautifulSoup 4.4 已发布,您可以使用该版本(或更新版本)并执行以下操作:

My feature request to the BeautifulSoup project was accepted and tweaked to use the copy.copy() function; now that BeautifulSoup 4.4 is released you can use that version (or newer) and do:

import copy

document2.body.append(copy.copy(document1.find('div', id_='someid')))

这篇关于用beautifulsoup克隆元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆