HTML XPath:提取文本时有选择地避免使用标签 [英] HTML XPath: Selectively avoiding tags when extracting text
问题描述
后续操作: HTML XPath:提取混合文本带有多个标签?
我使测试用例更加困难:
I've made my test case more difficult:
<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
<ol>
<li><small>Military</small> Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li>
<li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.<br/>Renowned cooking school.</li>
</ol>
</div>
我有相同的目标,即提取:
I have the same goal, namely, extracting:
- 中央情报局
- 美国烹饪学院
我可以有选择地选择排除哪些标签?
Can I selectively choose which tags are excluded?
我已经尝试过类似的操作(用于删除军事"):
I've tried things like (for removing 'Military'):
id('mw-content-text')/ol/li[not(self::small)]
但是该条件将整体应用于'li'节点,因此不受影响.
but that condition is applied to the 'li' node as a whole, so it's not affected.
如果我做类似的事情
id('mw-content-text')/ol/li/*[not(self::small)]
那我只过滤孩子们,即使我成功地抛弃了军事",我也抛弃了中央",烹饪",即父母的文字.
then I'm only filtering on the children, and even though I successfully throw away 'Military', I've also thrown away 'Central', 'Culinary', i.e. text from the parent.
我已经知道树是这样的:
I had understood the tree to be something like:
div -- li
-- small -- Military
-- Central
-- a -- Intelligence Agency
-- li
-- Culinary
-- a -- Institute
-- of
-- a -- America
-- br
-- Renowned cooking school.
那是正确的吗?有没有办法说"li的文字元素和li的后代,除了small的后代之外?"怎么样...除了br元素和所有后续的text元素之外?
Is that correct? Is there a way to say 'text elements of li and li's descendents EXCEPT descendents of small?' How about '... EXCEPT a br element and all following text elements'?
同样,也可以使用(部分)Pythonic解决方案,尽管首选XPath.
Again, use of (partial) Pythonic solutions are also acceptable, though XPath is preferred.
坐下来阅读Erik Ray的"Learning XML,第二版"的第6章"XPath和XPointer"之后,我想我已经掌握了.我想出了以下公式:
After sitting down to read Chapter 6 'XPath and XPointer' of 'Learning XML, Second Edition' by Erik Ray, I think I've got a grasp on it. I came up with the following formulation:
id('mw-content-text')/ol/li//text()[not(parent::small) and not(preceding-sibling::br)]
在这种情况下,似乎无法串联文本节点的结果节点集.当我们简单地将'li'元素提供给字符串函数时,结果字符串值只是元素节点li的后代的串联.但是在这种情况下,我们需要进行进一步的过滤,以使我们得到(合格文本节点的)节点集而不是单个元素节点.关于串联节点集,可以在此处找到一个有用的SO问题:
In this case, it doesn't seem possible to concatenate the resulting node set of text nodes. When we simply feed the 'li' element to the string function, then the resulting string-value is simply a concatenation of element node li's descendants. But in this case, we need to do further filtering, such that we result in a node set (of qualifying text nodes) instead of a single element node. Regarding concatenating node sets, a helpful SO question can be found here: XPath to return string concatenation of qualifying child node values
任何建议如何改进此解决方案?
Any advice how to improve this solution?
推荐答案
使用:
/*/ol/li/descendant-or-self::*
[text() and not(self::small)]
/text()[not(preceding-sibling::br)]
这篇关于HTML XPath:提取文本时有选择地避免使用标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!