HTML XPath:提取文本时有选择地避免使用标签 [英] HTML XPath: Selectively avoiding tags when extracting text

查看:47
本文介绍了HTML XPath:提取文本时有选择地避免使用标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

后续操作: HTML XPath:提取混合文本带有多个标签?

我使测试用例更加困难:

I've made my test case more difficult:

<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
<ol>
<li><small>Military</small> Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li>
<li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.<br/>Renowned cooking school.</li>
</ol>

</div>  

我有相同的目标,即提取:

I have the same goal, namely, extracting:

  • 中央情报局
  • 美国烹饪学院

我可以有选择地选择排除哪些标签?

Can I selectively choose which tags are excluded?

我已经尝试过类似的操作(用于删除军事"):

I've tried things like (for removing 'Military'):

id('mw-content-text')/ol/li[not(self::small)]

但是该条件将整体应用于'li'节点,因此不受影响.

but that condition is applied to the 'li' node as a whole, so it's not affected.

如果我做类似的事情

id('mw-content-text')/ol/li/*[not(self::small)]

那我只过滤孩子们,即使我成功地抛弃了军事",我也抛弃了中央",烹饪",即父母的文字.

then I'm only filtering on the children, and even though I successfully throw away 'Military', I've also thrown away 'Central', 'Culinary', i.e. text from the parent.

我已经知道树是这样的:

I had understood the tree to be something like:

div -- li  
          -- small -- Military  
          -- Central  
          -- a     -- Intelligence Agency  
    -- li  
          -- Culinary  
          -- a     -- Institute  
          -- of  
          -- a    -- America  
          -- br  
          -- Renowned cooking school.  

那是正确的吗?有没有办法说"li的文字元素和li的后代,除了small的后代之外?"怎么样...除了br元素和所有后续的text元素之外?

Is that correct? Is there a way to say 'text elements of li and li's descendents EXCEPT descendents of small?' How about '... EXCEPT a br element and all following text elements'?

同样,也可以使用(部分)Pythonic解决方案,尽管首选XPath.

Again, use of (partial) Pythonic solutions are also acceptable, though XPath is preferred.

坐下来阅读Erik Ray的"Learning XML,第二版"的第6章"XPath和XPointer"之后,我想我已经掌握了.我想出了以下公式:

After sitting down to read Chapter 6 'XPath and XPointer' of 'Learning XML, Second Edition' by Erik Ray, I think I've got a grasp on it. I came up with the following formulation:

id('mw-content-text')/ol/li//text()[not(parent::small) and not(preceding-sibling::br)]

在这种情况下,似乎无法串联文本节点的结果节点集.当我们简单地将'li'元素提供给字符串函数时,结果字符串值只是元素节点li的后代的串联.但是在这种情况下,我们需要进行进一步的过滤,以使我们得到(合格文本节点的)节点集而不是单个元素节点.关于串联节点集,可以在此处找到一个有用的SO问题:

In this case, it doesn't seem possible to concatenate the resulting node set of text nodes. When we simply feed the 'li' element to the string function, then the resulting string-value is simply a concatenation of element node li's descendants. But in this case, we need to do further filtering, such that we result in a node set (of qualifying text nodes) instead of a single element node. Regarding concatenating node sets, a helpful SO question can be found here: XPath to return string concatenation of qualifying child node values

任何建议如何改进此解决方案?

Any advice how to improve this solution?

推荐答案

使用:

 /*/ol/li/descendant-or-self::*
          [text() and not(self::small)]
              /text()[not(preceding-sibling::br)]

这篇关于HTML XPath:提取文本时有选择地避免使用标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆