使用Python和lxml仅剥离具有某些属性/值的标签 [英] Using Python and lxml to strip only the tags that have certain attributes/values
问题描述
我熟悉etree的strip_tags
和strip_elements
方法,但是我正在寻找一种剥离标签(并保留其内容)的简单方法,该标签仅包含特定的属性/值.
I'm familiar with etree's strip_tags
and strip_elements
methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values.
例如:我想从具有class='myclass'
属性/值的树(xhtm
l)中剥离所有span
或div
标签(或其他元素)(保留元素的内容,例如strip_tags
可以).同时,那些没有具有class='myclass'
的元素应该保持不变.
For instance: I'd like to strip all span
or div
tags (or other elements) from a tree (xhtm
l) that have a class='myclass'
attribute/value (preserving the element's contents like strip_tags
would do). Meanwhile, those same elements that don't have class='myclass'
should remain untouched.
相反:我想从树上剥离所有裸"的spans
或divs
的方法.仅表示那些具有 no 绝对属性的spans
/divs
(或与此相关的任何其他元素).保留那些具有属性(任何)的元素.
Conversely: I'd like a way to strip all "naked" spans
or divs
from a tree. Meaning only those spans
/divs
(or any other elements for that matter) that have absolutely no attributes. Leaving those same elements that have attributes (any) untouched.
我觉得我缺少明显的东西,但是我已经很幸运地寻找了很长一段时间.
I feel I'm missing something obvious, but I've been searching without any luck for quite some time.
推荐答案
HTML
lxml
的HTML元素具有方法 drop_tag()
可以调用由lxml.html
解析的树中的任何元素.
HTML
lxml
s HTML elements have a method drop_tag()
which you can call on any element in a tree parsed by lxml.html
.
它的作用与strip_tags
相似,因为它删除了元素,但保留了文本,并且可以在元素上称为 ,这意味着您可以轻松地选择不需要的元素对 XPath 表达式感兴趣,然后对其进行循环并删除它们:
It acts similar to strip_tags
in that it removes the element, but retains the text, and it can be called on the element - which means you can easily select the elements you're not interested in with an XPath expression, and then loop over them and remove them:
doc.html
doc.html
<html>
<body>
<div>This is some <span attr="foo">Text</span>.</div>
<div>Some <span>more</span> text.</div>
<div>Yet another line <span attr="bar">of</span> text.</div>
<div>This span will get <span attr="foo">removed</span> as well.</div>
<div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
<div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
</body>
</html>
strip.py
strip.py
from lxml import etree
from lxml import html
doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[@attr='foo']")
for span in spans_with_attrs:
span.drop_tag()
print etree.tostring(doc)
输出:
<html>
<body>
<div>This is some Text.</div>
<div>Some <span>more</span> text.</div>
<div>Yet another line <span attr="bar">of</span> text.</div>
<div>This span will get removed as well.</div>
<div>Nested elements will <b>be</b> left alone.</div>
<div>Unless they also match.</div>
</body>
</html>
在这种情况下,XPath表达式//span[@attr='foo']
选择具有值foo
的属性attr
的所有span
元素.有关如何构造XPath表达式的更多详细信息,请参见 XPath教程.
In this case, the XPath expression //span[@attr='foo']
selects all the span
elements with an attribute attr
of value foo
. See this XPath tutorial for more details on how to construct XPath expressions.
编辑:我刚刚注意到您在问题中特别提到了XHTML,根据文档,将其更好地解析为XML.不幸的是,drop_tag()
方法实际上仅可用于HTML文档中的元素.
Edit: I just noticed you specifically mention XHTML in your question, which according to the docs is better parsed as XML. Unfortunately, the drop_tag()
method is really only available for elements in a HTML document.
因此,对于XML来说,它有点复杂:
So for XML it's a bit more complicated:
doc.xml
doc.xml
<document>
<node>This is <span>some</span> text.</node>
<node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>
strip.py
strip.py
from lxml import etree
def strip_nodes(nodes):
for node in nodes:
text_content = node.xpath('string()')
# Include tail in full_text because it will be removed with the node
full_text = text_content + (node.tail or '')
parent = node.getparent()
prev = node.getprevious()
if prev:
# There is a previous node, append text to its tail
prev.tail += full_text
else:
# It's the first node in <parent/>, append to parent's text
parent.text = (parent.text or '') + full_text
parent.remove(node)
doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[@attr='foo']")
strip_nodes(nodes)
print etree.tostring(doc)
输出:
<document>
<node>This is <span>some</span> text.</node>
<node>Only this first span should <span>be</span> removed.</node>
</document>
如您所见,这将用递归文本内容替换节点和的所有子节点.我真的希望这就是您想要的,否则事情会变得更加复杂;-)
As you can see, this will replace node and all its children with the recursive text content. I really hope that's what you want, otherwise things get even more complicated ;-)
注意.上次修改已更改了相关代码.
NOTE Last edit have changed the code in question.
这篇关于使用Python和lxml仅剥离具有某些属性/值的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!