lxml / ElementTree和.tail [英] lxml/ElementTree and .tail

查看:67
本文介绍了lxml / ElementTree和.tail的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我环顾四周寻找一个特定于ElementTree的邮件列表,但发现

none - 如果这个问题论坛太广泛,我很抱歉。


我一直在使用ElementTree API的lxml变体,我理解它的工作方式大致相同(添加了一些重要的

)。特别是,它共享.tail属性的使用。

我在执行一些DOM

操作的同时专注于API的这个方面,它已经得到了我非常困惑。


示例:


>>来自lxml import etree as ET
frag = ET.XML(''< a> head< b> inside< / b> tail< / a>'')
b = frag .xpath(''// b'')[0]
b



<元素b at 71cbe8>


>> b.text



''''


>> b.tail



''tail''


>> frag.remove(b)
ET.tostring(frag)



''< a> head< / a>''


如您所见,.tail文本作为一部分被删除< belement

- 但它不属于< belement。我理解使用

.tail属性给出了通过避免

纯文本节点来简化API的愿望,但它似乎完全不适合尾巴

文本消失在以太,当技术上是兄弟节点时,
节点被移除。


使用Java DOM api执行相同的操作( crimson,在

这个案例中,结果表明产生了我期望的结果(这里我使用

JPype通过python访问v1.4.2 JVM - 让事情变得有点痛苦):


>> ;来自jpype import *
startJVM(getDefaultJVMPath())
builder = javax.xml.parsers.DocumentBuilderFactory.newInstan ce



()。newDocumentBuilder()


>> xml = java.io.ByteArrayInputStream(java.lang.String



(''< ;> head< b>里面< / b>尾部< / a>'')。getBytes())


>> doc = builder.parse(xml)
a = doc.documentElement
a.toString()



u''< a> head< b> inside< / b> tail< / a>''


>> b = a.getElementsByTagName(''b'')。item(0)
a.removeChild(b)
a.toString()



u''< a> headtail< / a>''


(很抱歉Java的比较,但这就是我第一次在牙齿上扫描的地方

,这就是我的期望值。 d。)


这是一个非常重要的功能不匹配。我当然理解Lundh先生的动机是尽可能将ET API作为
pythonic,但ET在这个特定背景下的行为是
就我所知,
完全错了。我原以为

删除操作会将< b>的尾部文本附加到

< a(或者可能是尾部文本)的文本中< b>'最近的兄弟姐妹)

- 我认为我必须做的事情才能继续使用lxml / b
ElementTree。


我把这个问题告诉了一些我认识的人和我曾经和他们一起工作过的关于ElementTree的
,以及他们对这个明显的回应< ET DOM API和标准之间的差异
差异DOM API大致是:b $ b:这就是它的方式。


评论,想法?


Chas Emerick

Snowtide信息系统创始人

企业级PDF内容提取

ce ****** @ snowtide.com
http://snowtide.com | +1 413.519.6365

解决方案




Chas Emerick写道:


我四处寻找一个特定于ElementTree的邮件列表,但没有找到

- 如果这个问题的论坛过于宽泛,我会道歉。



lxml邮件列表总是很乐意接收反馈,但是如果不是这样的话,可以在这里询问
lxml具体。


我一直在使用ElementTree API的lxml变种。

它分享使用一个.tail属性。我在执行一些DOM

操作的同时专注于API的这个方面,而且让我非常困惑。


示例:


>>>来自lxml import etree as ET
frag = ET.XML (''< a> head< b> inside< / b> tail< / a>'')
b = frag.xpath(''// b'')[0]
b



<元素b在71cbe8>


>>> b.text



''在''


>>> b.tail



''tail''


>>> frag.remove( b)
ET.tostring(frag)



''< a> head< / a>''


如您所见,.tail文本作为< belement -

的一部分被删除,但它不是< belement的一部分。



是的,确实如此。只需看看API。它是Element的一个属性,不是吗?

你知道从数据结构中删除元素的其他什么API

留下部分元素背后?


如果你想将部分删除的元素复制回树中,请随意

这样做。


使用Java DOM执行相同的操作api

(很抱歉Java比较,但那是我第一次开始关注的地方

XML,这就是我的期望形成的地方。)


这是一个非常重要的功能不匹配。



恕我直言,DOM与Python有很大的不匹配。


我运行了这个发表了一些我认识的人,他们曾与ElementTree合作并撰写过关于ElementTree的
,以及他们对ET DOM API和标准之间明显分歧的回应

; DOM API粗略地说:那只是它的方式是b $ b。



这只是理解(或习惯)API的问题。你可能会想要停止用''<''和''>''来思考它而宁愿接受API

本身作为一种合作方式XML Infoset(而不是XML DOM)。


Stefan


Stefan Behnel写道:


如果你想将部分已删除元素复制回树中,请随意

这样做。



当然可以用一个简短的辅助函数来完成。


从树中删除元素时,我经常设置那些

元素的标签对某些垃圾元素处理过程中的价值,然后调用

类似

http://effbot.org/zone/element-bits-...es.htm#cleanup


在序列化树之前清理东西。


< / F>


Stefan Behnel写道:


>



[删除元素,删除以下节点]


是的,确实如此。只需看看API。它是Element的一个属性,不是吗?

你知道从数据结构中删除元素的其他什么API

留下部分元素背后?



我想这取决于你认为元素是什么......


[...]


恕我直言,DOM与Python有很大的不匹配。在DOM或其​​他方面



....

http://www.w3.org/TR/2006/REC-xml-20...logical -struct


Paul


I looked around for an ElementTree-specific mailing list, but found
none -- my apologies if this is too broad a forum for this question.

I''ve been using the lxml variant of the ElementTree API, which I
understand works in much the same way (with some significant
additions). In particular, it shares the use of a .tail attribute.
I ran headlong into this aspect of the API while doing some DOM
manipulations, and it''s got me pretty confused.

Example:

>>from lxml import etree as ET
frag = ET.XML(''<a>head<b>inside</b>tail</a>'')
b = frag.xpath(''//b'')[0]
b

<Element b at 71cbe8>

>>b.text

''inside''

>>b.tail

''tail''

>>frag.remove(b)
ET.tostring(frag)

''<a>head</a>''

As you can see, the .tail text is removed as part of the <belement
-- but it IS NOT part of the <belement. I understand the use of
the .tail attribute given the desire to simplify the API by avoiding
pure text nodes, but it seems entirely inappropriate for the tail
text to disappear into the ether when what is technically a sibling
node is removed.

Performing the same operations with the Java DOM api (crimson, in
this case it turns out) yields what I would expect (here I''m using
JPype to access a v1.4.2 JVM through python -- which makes things
somewhat less painful):

>>from jpype import *
startJVM(getDefaultJVMPath())
builder = javax.xml.parsers.DocumentBuilderFactory.newInstan ce

().newDocumentBuilder()

>>xml = java.io.ByteArrayInputStream(java.lang.String

(''<a>head<b>inside</b>tail</a>'').getBytes())

>>doc = builder.parse(xml)
a = doc.documentElement
a.toString()

u''<a>head<b>inside</b>tail</a>''

>>b = a.getElementsByTagName(''b'').item(0)
a.removeChild(b)
a.toString()

u''<a>headtail</a>''

(Sorry for the Java comparison, but that''s where I first cut my teeth
on XML, and that''s where my expectations were formed.)

That''s a pretty significant mismatch in functionality. I certainly
understand the motivations of Mr. Lundh to make the ET API as
pythonic as possible, but ET''s behaviour in this specific context is
flatly wrong as far as I can see. I would have expected that a
removal operation would have appended <b>''s tail text to the text of
<a(or perhaps to the tail text of <b>''s closest preceding sibling)
-- something that I think I''m going to have to do in order to
continue using lxml / ElementTree.

I ran this issue past a few people I know who''ve worked with and
written about ElementTree, and their response to this apparent
divergence between the ET DOM API and "standard" DOM APIs was
roughly: "that''s just the way it is".

Comments, thoughts?

Chas Emerick
Founder, Snowtide Informatics Systems
Enterprise-class PDF content extraction

ce******@snowtide.com
http://snowtide.com | +1 413.519.6365

解决方案

Hi,

Chas Emerick wrote:

I looked around for an ElementTree-specific mailing list, but found none
-- my apologies if this is too broad a forum for this question.

The lxml mailing list is always happy to receive feedback, but it''s fine to
ask here if it''s not lxml specific.

I''ve been using the lxml variant of the ElementTree API.
it shares the use of a .tail attribute. I
ran headlong into this aspect of the API while doing some DOM
manipulations, and it''s got me pretty confused.

Example:

>>>from lxml import etree as ET
frag = ET.XML(''<a>head<b>inside</b>tail</a>'')
b = frag.xpath(''//b'')[0]
b

<Element b at 71cbe8>

>>>b.text

''inside''

>>>b.tail

''tail''

>>>frag.remove(b)
ET.tostring(frag)

''<a>head</a>''

As you can see, the .tail text is removed as part of the <belement --
but it IS NOT part of the <belement.

Yes, it is. Just look at the API. It''s an attribute of an Element, isn''t it?
What other API do you know where removing an element from a data structure
leaves part of the element behind?

If you want to copy part of of removed element back into the tree, feel free
to do so.

Performing the same operations with the Java DOM api
(Sorry for the Java comparison, but that''s where I first cut my teeth on
XML, and that''s where my expectations were formed.)

That''s a pretty significant mismatch in functionality.

IMHO, DOM has a pretty significant mismatch with Python.

I ran this issue past a few people I know who''ve worked with and written
about ElementTree, and their response to this apparent divergence
between the ET DOM API and "standard" DOM APIs was roughly: "that''s just
the way it is".

It''s just a matter of understanding (or getting used to) the API. You might
want to stop thinking in terms of ''<'' and ''>'' and rather embrace the API
itself as a way to work with the XML Infoset (rather than the XML DOM).

Stefan


Stefan Behnel wrote:

If you want to copy part of of removed element back into the tree, feel free
to do so.

and that can of course be done with a short helper function.

when removing elements from trees, I often set the tag for those
elements to some "garbage" value during processing, and then call
something like

http://effbot.org/zone/element-bits-...es.htm#cleanup

to clean things up before serializing the tree.

</F>


Stefan Behnel wrote:

>

[Remove an element, remove following nodes]

Yes, it is. Just look at the API. It''s an attribute of an Element, isn''t it?
What other API do you know where removing an element from a data structure
leaves part of the element behind?

I guess it depends on what you regard an element to be...

[...]

IMHO, DOM has a pretty significant mismatch with Python.

....in the DOM or otherwise:

http://www.w3.org/TR/2006/REC-xml-20...logical-struct

Paul


这篇关于lxml / ElementTree和.tail的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆