HTML XPath:使用多个标签提取混合文本? [英] HTML XPath: Extracting text mixed in with multiple tags?

查看:346
本文介绍了HTML XPath:使用多个标签提取混合文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标:从特定元素(如li)中提取文本,同时忽略各种混合标签,例如扁平化第一级子元素,并简单地分别返回每个扁平化子元素的拼接文本。



示例:

 < div id =mw-content-text>< h2>< span class =mw-headline> CIA< / span>< / h2> 
< ol>
< li>中央< a href =/ Intelligence_Agency.html>智能机构< / a>。< / li>
< li>烹饪< a href =/ Institute.html>研究所< / a> < a href =/ America.html>美国< / a>。< / li>
< / ol>

< / Div>

所需文字:


  • 中央情报局

  • 美国烹饪学院



要分别返回每个li标签,我们直接使用:

  // div [contains(@id,mw-content-text)] / ol / li 

,但也包括周围的锚点标签等。

  // div [contains(@ id,mw-content-text)] / ol / li / text()

文本元素是li的直接子元素,即'Central','。'..



看起来自然和后代的文本元素似乎是合乎逻辑的

  // div [contains(@id,mw-content-text)] / ol / li [descendant-or- self :: text] 

但完全没有返回!



有什么建议吗?我使用Python,因此我愿意使用其他模块进行后期处理。



(我使用的是Scrapy HtmlXPathSelector,它似乎符合XPath 1.0) p>

解决方案

你快到了。 中存在一个小问题:


// div [contains(@id,mw-content-text)] / ol / li [descendant-or-self :: text]

正确的表达式为

  // div [contains(@id,mw-content-text)] / ol / li [descendant-or-self :: text()] 

然而,有一个更简单的表达式,可以精确地生成所需的 li $ b

string(// div [contains(@ id,mw-content-text)] / ol / li)


Goal: Extract text from a particular element (e.g. li), while ignoring the various mixed in tags, i.e. flatten the first-level child and simply return the concatenated text of each flattened child separately.

Example:

<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
    <ol>
    <li>Central <a href="/Intelligence_Agency.html">Intelligence Agency</a>.</li>
    <li>Culinary <a href="/Institute.html">Institute</a> of <a href="/America.html">America</a>.</li>
    </ol>

    </Div>  

desired text:

  • Central Intelligence Agency
  • Culinary Institute of America

Except that the anchor tags surrounding prevent a simple retrieval.

To return each li tag separately, we use the straightforward:

//div[contains(@id,"mw-content-text")]/ol/li

but that also includes surrounding anchor tags, etc. And

//div[contains(@id,"mw-content-text")]/ol/li/text()

returns only the text elements that are direct children of li, i.e. 'Central','.'...

It seemed logical then to look for text elements of self and descendants

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]

but that returns nothing at all!

Any suggestions? I'm using Python, so I'm open to using other modules for post-processing.

(I am using the Scrapy HtmlXPathSelector which seems XPath 1.0 compliant)

解决方案

You were almost there. There is a small problem in:

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text]

The corrected expression is:

//div[contains(@id,"mw-content-text")]/ol/li[descendant-or-self::text()]

However, there is a simpler expression that produces exactly the wanted concatenation of all text-nodes under the specified li:

string(//div[contains(@id,"mw-content-text")]/ol/li)

这篇关于HTML XPath:使用多个标签提取混合文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆