使用Python和lxml仅剥离具有某些属性/值的标签 [英] Using Python and lxml to strip only the tags that have certain attributes/values

查看:170
本文介绍了使用Python和lxml仅剥离具有某些属性/值的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我熟悉etree的strip_tagsstrip_elements方法,但是我正在寻找一种剥离标签(并保留其内容)的简单方法,该标签仅包含特定的属性/值.

I'm familiar with etree's strip_tags and strip_elements methods, but I'm looking for a straightforward way of stripping tags (and leaving their contents) that only contain particular attributes/values.

例如:我想从具有class='myclass'属性/值的树(xhtm l)中剥离所有spandiv标签(或其他元素)(保留元素的内容,例如strip_tags可以).同时,那些没有具有class='myclass'的元素应该保持不变.

For instance: I'd like to strip all span or div tags (or other elements) from a tree (xhtml) that have a class='myclass' attribute/value (preserving the element's contents like strip_tags would do). Meanwhile, those same elements that don't have class='myclass' should remain untouched.

相反:我想从树上剥离所有裸"的spansdivs的方法.仅表示那些具有 no 绝对属性的spans/divs(或与此相关的任何其他元素).保留那些具有属性(任何)的元素.

Conversely: I'd like a way to strip all "naked" spans or divs from a tree. Meaning only those spans/divs (or any other elements for that matter) that have absolutely no attributes. Leaving those same elements that have attributes (any) untouched.

我觉得我缺少明显的东西,但是我已经很幸运地寻找了很长一段时间.

I feel I'm missing something obvious, but I've been searching without any luck for quite some time.

推荐答案

HTML

lxml的HTML元素具有方法 drop_tag() 可以调用由lxml.html解析的树中的任何元素.

HTML

lxmls HTML elements have a method drop_tag() which you can call on any element in a tree parsed by lxml.html.

它的作用与strip_tags相似,因为它删除了元素,但保留了文本,并且可以在元素上称为 ,这意味着您可以轻松地选择不需要的元素对 XPath 表达式感兴趣,然后对其进行循环并删除它们:

It acts similar to strip_tags in that it removes the element, but retains the text, and it can be called on the element - which means you can easily select the elements you're not interested in with an XPath expression, and then loop over them and remove them:

doc.html

doc.html

<html>
    <body>
        <div>This is some <span attr="foo">Text</span>.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get <span attr="foo">removed</span> as well.</div>
        <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
        <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
    </body>
</html>

strip.py

strip.py

from lxml import etree
from lxml import html

doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[@attr='foo']")

for span in spans_with_attrs:
    span.drop_tag()

print etree.tostring(doc)

输出:

<html>
    <body>
        <div>This is some Text.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get removed as well.</div>
        <div>Nested elements will <b>be</b> left alone.</div>
        <div>Unless they also match.</div>
    </body>
</html>

在这种情况下,XPath表达式//span[@attr='foo']选择具有值foo的属性attr的所有span元素.有关如何构造XPath表达式的更多详细信息,请参见 XPath教程.

In this case, the XPath expression //span[@attr='foo'] selects all the span elements with an attribute attr of value foo. See this XPath tutorial for more details on how to construct XPath expressions.

编辑:我刚刚注意到您在问题中特别提到了XHTML,根据文档,将其更好地解析为XML.不幸的是,drop_tag()方法实际上仅可用于HTML文档中的元素.

Edit: I just noticed you specifically mention XHTML in your question, which according to the docs is better parsed as XML. Unfortunately, the drop_tag() method is really only available for elements in a HTML document.

因此,对于XML来说,它有点复杂:

So for XML it's a bit more complicated:

doc.xml

doc.xml

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>

strip.py

strip.py

from lxml import etree


def strip_nodes(nodes):
    for node in nodes:
        text_content = node.xpath('string()')

        # Include tail in full_text because it will be removed with the node
        full_text = text_content + (node.tail or '')

        parent = node.getparent()
        prev = node.getprevious()
        if prev:
            # There is a previous node, append text to its tail
            prev.tail += full_text
        else:
            # It's the first node in <parent/>, append to parent's text
            parent.text = (parent.text or '') + full_text
        parent.remove(node)


doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[@attr='foo']")
strip_nodes(nodes)

print etree.tostring(doc)

输出:

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this first span should <span>be</span> removed.</node>
</document>

如您所见,这将用递归文本内容替换节点的所有子节点.我真的希望这就是您想要的,否则事情会变得更加复杂;-)

As you can see, this will replace node and all its children with the recursive text content. I really hope that's what you want, otherwise things get even more complicated ;-)

注意.上次修改已更改了相关代码.

NOTE Last edit have changed the code in question.

这篇关于使用Python和lxml仅剥离具有某些属性/值的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆