lmxl增量XML序列化重复名称空间 [英] lmxl incremental XML serialisation repeats namespaces

查看:96
本文介绍了lmxl增量XML序列化重复名称空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用lxml在Python中序列化一些较大的XML文件.我想为此使用增量编写器.我的XML格式严重依赖于名称空间和属性.当我运行以下代码

I am currently serializing some largish XML files in Python with lxml. I want to use the incremental writer for that. My XML format heavily relies on namespaces and attributes. When I run the following code

from io import BytesIO

from lxml import etree

sink = BytesIO()

nsmap = {
    'test': 'http://test.org',
    'foo': 'http://foo.org',
    'bar': 'http://bar.org',
}

with etree.xmlfile(sink) as xf:
    with xf.element("test:testElement", nsmap=nsmap):
        name = etree.QName(nsmap["foo"], "fooElement")
        elem = etree.Element(name)

        xf.write(elem)

print(sink.getvalue().decode('utf-8'))

然后我得到以下输出:

<test:testElement xmlns:bar="http://bar.org" 
 xmlns:foo="http://foo.org" 
 xmlns:test="http://test.org">
    <ns0:fooElement xmlns:ns0="http://foo.org"/>
</test:testElement>

如您所见, foo 的名称空间是重复的,而不是我的前缀:

As you can see, the namespace for foo is repeated and not my prefix:

<ns0:fooElement xmlns:ns0="http://foo.org"/>

如何使lxml仅在根目录中添加名称空间,而子级从那里使用正确的前缀?我想我需要使用 etree.Element ,因为我需要向节点添加一些属性.

How do I make it so lxml only adds the namespace in the root and children use the correct prefix from there? I think I need to use etree.Element, as I need to add some attributes to the node.

什么不起作用:

1)使用 register_namespace

for prefix, uri in nsmap.items():
    etree.register_namespace(prefix, uri)

这仍然重复,但是使前缀正确.我不太喜欢它,因为它在全球范围内都会改变.

That still repeats, but makes the prefix correct. I do not like it too much, as it changes stuff globally.

2)在元素中指定 nsmap :

elem = etree.Element(name, nsmap=nsmap)

收益

<foo:fooElement xmlns:bar="http://bar.org" 
 xmlns:foo="http://foo.org" 
 xmlns:test="http://test.org"/>

用于 fooElement .

我也查看了lxml的文档和源代码,但是Cython真的很难阅读和搜索. xf.element 的上下文管理器不返回该元素.例如

I also looked in the documentation and source code of lxml, but it is Cython so really hard to read and search. The context manager of xf.element does not return the element. e.g.

with xf.element('foo:fooElement') as e:
    print(e)

不打印.

推荐答案

有可能产生与您要寻找的东西接近的东西:

It is possible to produce something close to what you are looking for:

from io import BytesIO

from lxml import etree

sink = BytesIO()

nsmap = {
    'test': 'http://test.org',
    'foo': 'http://foo.org',
    'bar': 'http://bar.org',
}

with etree.xmlfile(sink) as xf:
    with xf.element("test:testElement", nsmap=nsmap):
        with xf.element("foo:fooElement"):
            pass

print(sink.getvalue().decode('utf-8'))

这将产生XML:

<test:testElement xmlns:bar="http://bar.org" xmlns:foo="http://foo.org" xmlns:test="http://test.org"><foo:fooElement></foo:fooElement></test:testElement>

多余的名称空间声明不见了,但是您得到了一对 foo:fooElement 的开始和结束标记,而不是立即关闭的元素.

The extra namespace declaration is gone, but instead of an immediately closing element, you get a pair of opening and closing tags for foo:fooElement.

我查看了 lxml.etree.xmlfile 的源代码,但没有看到那里的代码保持状态,然后它会检查知道已经声明了哪些名称空间,并避免不必要地再次声明它们.我可能只是错过了一些东西,但我真的不认为自己做到了.增量XML序列化程序的要点是无需使用内存块即可进行操作.当内存不成问题时,您只需创建代表XML文档的对象树并将其序列化即可.您需要支付大量的内存成本,因为整个树必须在内存中可用,直到序列化该树为止.通过使用增量串行器,可以避免出现内存问题.为了最大程度地节省内存,串行器必须将其维护的状态量最小化.如果当它在序列化中生成一个元素时,要考虑到该元素的父元素,则它必须记住"父元素是什么并保持状态.在最坏的情况下,它会保持如此多的状态,以至于只创建一棵XML对象树然后进行序列化不会带来任何好处.

I looked at the source code of lxml.etree.xmlfile and do not see the code there maintaining state that it would then examine to know what namespaces are already declared and avoid declaring them again needlessly. It is possible I just missed something, but I really don't think I did. The point of an incremental XML serializer is to operate without using gobs of memory. When memory is not an issue, you can just create a tree of objects representing the XML document and serialize that. You pay a significant memory cost because the whole tree has to be available in memory until the tree is serialized. By using an incremental serializer, you can dodge the memory issue. In order to maximize the memory savings, the serializer must minimize the amount of state it maintains. If when it produces an element in the serialization, it were to take into account the parents of this element, then it would have to "remember" what the parents were and maintain state. In the worst case scenario it would maintain so much state that it would provide no benefit over just creating a tree of XML objects that are then serialized.

这篇关于lmxl增量XML序列化重复名称空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆