使用lxml/ElementTree获取非连续文本 [英] Getting non-contiguous text with lxml / ElementTree

查看:100
本文介绍了使用lxml/ElementTree获取非连续文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有这种HTML,我需要使用lxml/ElementTree从中选择"text2":

Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree:

<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>

如果我已经将div元素作为mydiv,则mydiv.text仅返回"text1".

If I already have the div element as mydiv, then mydiv.text returns just "text1".

使用itertext()充其量似乎是有问题的或麻烦的,因为它将整个树都遍历了div.

Using itertext() seems problematic or cumbersome at best since it walks the entire tree under the div.

是否有任何简单/优雅的方法从元素中提取非第一个文本块?

Is there any simple/elegant way to extract a non-first text chunk from an element?

推荐答案

lxml.etree提供了完整的XPath支持,使您可以处理文本项:

Well, lxml.etree provides full XPath support, which allows you to address the text items:

>>> import lxml.etree
>>> fragment = '<div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div>'
>>> div = lxml.etree.fromstring(fragment)
>>> div.xpath('./text()')
['text1', 'text2', 'text3']

这篇关于使用lxml/ElementTree获取非连续文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆