使用 XPath 提取标签之间的文本,包括标记 [英] Extract text between tags with XPath including markup

查看:73
本文介绍了使用 XPath 提取标签之间的文本,包括标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下一段 XML:

...<span class="st">在 Tim <em>Power</em>: Politieman...</span>...

我想提取 标签之间的部分.为此,我使用 XPath:

/span[@class="st"]

然而,这将提取所有内容,包括 .和.

/span[@class="st"]/text()

将返回两个文本元素的列表.一个包含在蒂姆".另一个:政治家".<em>..</em> 不包括在内,并像分隔符一样处理.

是否有返回的纯 XPath 解决方案:

在 Tim <em>Power</em>: Politieman...

编辑感谢@helderdarocha 和@TextGeek.使用仅包含 的 XPath 提取纯文本似乎并非易事.

/span[@class="st"]/node() 解决方案创建一个包含各个行的列表,在 Python 中从列表中创建一个字符串是微不足道的.

解决方案

要获取任何子节点,您可以使用:

/span[@class="st"]/node()

这将返回:

  1. 两个子文本节点
  2. 完整的 节点(元素和内容).

如果您确实想要所有 text() 节点,包括 em 中的节点,则获取所有 text() 后代:

/span[@class="st"]//text()

/span[@class="st"]/descendant::text()

这将返回三个文本节点,文本 inside ,而不是 元素.>

I have the following piece of XML:

...<span class="st">In Tim <em>Power</em>: Politieman...</span>...

I want to extract the part between the <span> tags. For this I use XPath:

   /span[@class="st"]

This however will extract everything including the <span>. and.

  /span[@class="st"]/text()

will return a list of two text elements. One containing "In Tim". The other ":Politieman". The <em>..</em> is not included and is handled like a separator.

Is there a pure XPath solution which returns:

In Tim <em>Power</em>: Politieman...

EDIT Thanks to @helderdarocha and @TextGeek. Seems non trivial to extract plain text with XPath only including the <em>.

The /span[@class="st"]/node() solution creates a list containing the individual lines, from which it is trivial in Python to create a String.

解决方案

To get any child node you can use:

/span[@class="st"]/node()

This will return:

  1. Two child text nodes
  2. The full <em> node (element and contents).

If you actually want all the text() nodes, including the ones inside em, then get all the text() descendants:

/span[@class="st"]//text()

or

/span[@class="st"]/descendant::text()

This will return three text nodes, the text inside <em>, but not the <em> elements.

这篇关于使用 XPath 提取标签之间的文本,包括标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆