HTML XPath:提取文本时有选择地避免使用标签 [英] HTML XPath: Selectively avoiding tags when extracting text

查看：47 发布时间：2021/5/14 21:03:50 html xpath

本文介绍了HTML XPath:提取文本时有选择地避免使用标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使测试用例更加困难:

I've made my test case more difficult:

<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
<ol>
<li><small>Military</small> Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li>
<li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.<br/>Renowned cooking school.</li>
</ol>

</div>

我有相同的目标，即提取:

I have the same goal, namely, extracting:

中央情报局
美国烹饪学院

我可以有选择地选择排除哪些标签?

Can I selectively choose which tags are excluded?

我已经尝试过类似的操作(用于删除军事"):

I've tried things like (for removing 'Military'):

id('mw-content-text')/ol/li[not(self::small)]

但是该条件将整体应用于'li'节点，因此不受影响.

but that condition is applied to the 'li' node as a whole, so it's not affected.

如果我做类似的事情

id('mw-content-text')/ol/li/*[not(self::small)]

那我只过滤孩子们，即使我成功地抛弃了军事"，我也抛弃了中央"，烹饪"，即父母的文字.

then I'm only filtering on the children, and even though I successfully throw away 'Military', I've also thrown away 'Central', 'Culinary', i.e. text from the parent.

我已经知道树是这样的:

I had understood the tree to be something like:

div -- li  
          -- small -- Military  
          -- Central  
          -- a     -- Intelligence Agency  
    -- li  
          -- Culinary  
          -- a     -- Institute  
          -- of  
          -- a    -- America  
          -- br  
          -- Renowned cooking school.

那是正确的吗?有没有办法说"li的文字元素和li的后代，除了small的后代之外?"怎么样...除了br元素和所有后续的text元素之外?

Is that correct? Is there a way to say 'text elements of li and li's descendents EXCEPT descendents of small?' How about '... EXCEPT a br element and all following text elements'?

同样，也可以使用(部分)Pythonic解决方案，尽管首选XPath.

Again, use of (partial) Pythonic solutions are also acceptable, though XPath is preferred.

坐下来阅读Erik Ray的"Learning XML，第二版"的第6章"XPath和XPointer"之后，我想我已经掌握了.我想出了以下公式:

After sitting down to read Chapter 6 'XPath and XPointer' of 'Learning XML, Second Edition' by Erik Ray, I think I've got a grasp on it. I came up with the following formulation:

id('mw-content-text')/ol/li//text()[not(parent::small) and not(preceding-sibling::br)]

在这种情况下，似乎无法串联文本节点的结果节点集.当我们简单地将'li'元素提供给字符串函数时，结果字符串值只是元素节点li的后代的串联.但是在这种情况下，我们需要进行进一步的过滤，以使我们得到(合格文本节点的)节点集而不是单个元素节点.关于串联节点集，可以在此处找到一个有用的SO问题:

In this case, it doesn't seem possible to concatenate the resulting node set of text nodes. When we simply feed the 'li' element to the string function, then the resulting string-value is simply a concatenation of element node li's descendants. But in this case, we need to do further filtering, such that we result in a node set (of qualifying text nodes) instead of a single element node. Regarding concatenating node sets, a helpful SO question can be found here: XPath to return string concatenation of qualifying child node values

任何建议如何改进此解决方案?

Any advice how to improve this solution?

HTML XPath:提取文本时有选择地避免使用标签 [英] HTML XPath: Selectively avoiding tags when extracting text

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

HTML XPath:提取文本时有选择地避免使用标签 [英] HTML XPath: Selectively avoiding tags when extracting text

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭