HTML XPath:提取与多个标签混合的文本? [英] HTML XPath: Extracting text mixed in with multiple tags?

查看：31 发布时间：2022/1/4 20:42:10 html xpath scrapy

本文介绍了HTML XPath:提取与多个标签混合的文本?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

目标:从特定元素(例如 li)中提取文本，同时忽略各种混入的标签，即展平第一级子元素并分别返回每个展平后的子元素的连接文本.

示例:

<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2><ol><li>中央<a href="/Intelligence_Agency.html">情报局</a>.</li><li>烹饪<a href="/Institute.html">Institute</a><a href="/America.html">美国</a>.</li></ol>

想要的文字:

中央情报局
美国烹饪学院

除了周围的锚标记会阻止简单的检索.

为了分别返回每个 li 标签，我们使用简单的:

//div[contains(@id,"mw-content-text")]/ol/li

但这也包括周围的锚标签等.

//div[contains(@id,"mw-content-text")]/ol/li/text()

仅返回作为 li 的直接子元素的文本元素，即 'Central','.'...

然后寻找 self 和后代的文本元素似乎是合乎逻辑的

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]

但这根本没有返回任何东西！

有什么建议吗?我正在使用 Python，所以我愿意使用其他模块进行后期处理.

(我正在使用 Scrapy HtmlXPathSelector，它似乎符合 XPath 1.0)

解决方案

您就快到了.有一个小问题:

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]

更正后的表达式为:

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text()]

但是，有一个更简单的表达式，可以准确地生成指定 li 下所有文本节点的所需连接:

string(//div[contains(@id,"mw-content-text")]/ol/li)

Goal: Extract text from a particular element (e.g. li), while ignoring the various mixed in tags, i.e. flatten the first-level child and simply return the concatenated text of each flattened child separately.

Example:

<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
    <ol>
    <li>Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li>
    <li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.</li>
    </ol>

    </Div>

desired text:

Central Intelligence Agency
Culinary Institute of America

Except that the anchor tags surrounding prevent a simple retrieval.

To return each li tag separately, we use the straightforward:

//div[contains(@id,"mw-content-text")]/ol/li

but that also includes surrounding anchor tags, etc. And

//div[contains(@id,"mw-content-text")]/ol/li/text()

returns only the text elements that are direct children of li, i.e. 'Central','.'...

It seemed logical then to look for text elements of self and descendants

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]

but that returns nothing at all!

Any suggestions? I'm using Python, so I'm open to using other modules for post-processing.

(I am using the Scrapy HtmlXPathSelector which seems XPath 1.0 compliant)

解决方案

You were almost there. There is a small problem in:

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]

The corrected expression is:

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text()]

However, there is a simpler expression that produces exactly the wanted concatenation of all text-nodes under the specified li:

string(//div[contains(@id,"mw-content-text")]/ol/li)

这篇关于HTML XPath:提取与多个标签混合的文本?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

HTML XPath:提取与多个标签混合的文本? [英] HTML XPath: Extracting text mixed in with multiple tags?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

HTML XPath:提取与多个标签混合的文本? [英] HTML XPath: Extracting text mixed in with multiple tags?

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭