HTML XPath：使用多个标签提取混合文本？ [英] HTML XPath: Extracting text mixed in with multiple tags?

查看：346 发布时间：2018/6/14 18:07:29 html xpath scrapy

本文介绍了HTML XPath：使用多个标签提取混合文本？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

目标：从特定元素（如li）中提取文本，同时忽略各种混合标签，例如扁平化第一级子元素，并简单地分别返回每个扁平化子元素的拼接文本。

示例：

 < div id =mw-content-text>< h2>< span class =mw-headline> CIA< / span>< / h2> 
< ol> 
< li>中央< a href =/ Intelligence_Agency.html>智能机构< / a>。< / li> 
< li>烹饪< a href =/ Institute.html>研究所< / a> < a href =/ America.html>美国< / a>。< / li> 
< / ol> 
 
< / Div>

所需文字：

中央情报局

美国烹饪学院

要分别返回每个li标签，我们直接使用：

// div [contains（@id，mw-content-text）] / ol / li
，但也包括周围的锚点标签等。

// div [contains（@ id，mw-content-text）] / ol / li / text（）
文本元素是li的直接子元素，即'Central'，'。'..

看起来自然和后代的文本元素似乎是合乎逻辑的
// div [contains（@id，mw-content-text）] / ol / li [descendant-or- self :: text]
但完全没有返回！

有什么建议吗？我使用Python，因此我愿意使用其他模块进行后期处理。

（我使用的是Scrapy HtmlXPathSelector，它似乎符合XPath 1.0） p>

解决方案
你快到了。 中存在一个小问题：

// div [contains（@id，mw-content-text）] / ol / li [descendant-or-self :: text]
正确的表达式为：

// div [contains（@id，mw-content-text）] / ol / li [descendant-or-self :: text（）]
然而，有一个更简单的表达式，可以精确地生成所需的 li $ b
string（// div [contains（@ id，mw-content-text）] / ol / li）

Goal: Extract text from a particular element (e.g. li), while ignoring the various mixed in tags, i.e. flatten the first-level child and simply return the concatenated text of each flattened child separately.

Example:
<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2> <ol> <li>Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li> <li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.</li> </ol> </Div>
desired text:

Central Intelligence Agency

Culinary Institute of America

Except that the anchor tags surrounding prevent a simple retrieval.

To return each li tag separately, we use the straightforward:
//div[contains(@id,"mw-content-text")]/ol/li
but that also includes surrounding anchor tags, etc. And
//div[contains(@id,"mw-content-text")]/ol/li/text()
returns only the text elements that are direct children of li, i.e. 'Central','.'...

It seemed logical then to look for text elements of self and descendants
//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]
but that returns nothing at all!

Any suggestions? I'm using Python, so I'm open to using other modules for post-processing.

(I am using the Scrapy HtmlXPathSelector which seems XPath 1.0 compliant)
解决方案
You were almost there. There is a small problem in:
//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]
The corrected expression is:
//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text()]
However, there is a simpler expression that produces exactly the wanted concatenation of all text-nodes under the specified li:
string(//div[contains(@id,"mw-content-text")]/ol/li)

这篇关于HTML XPath：使用多个标签提取混合文本？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

HTML XPath：使用多个标签提取混合文本？ [英] HTML XPath: Extracting text mixed in with multiple tags?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

HTML XPath：使用多个标签提取混合文本？ [英] HTML XPath: Extracting text mixed in with multiple tags?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭