xpath表达式不起作用 [英] xpath expression not working

查看:50
本文介绍了xpath表达式不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

    <DOC NUMBER=1>
<DOCFULL> -->
<br><div class="c0">
<p class="c1"><span class="c2">Dokument 1 von 3</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">Associated Press Financial Wire</span></p>
</div>
<br><div class="c3">
<p class="c1"><span class="c2">April 25, 2012 Wednesday 9:18 PM GMT </span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c6">Apple CEO Tim Cook emerges from Steve Jobs' shadow</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">BYLINE: </span><span class="c2">By PETER SVENSSON, AP Technology Writer</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">SECTION: </span><span class="c2">BUSINESS NEWS</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">LENGTH: </span><span class="c2">794 words</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">DATELINE: </span><span class="c2">NEW YORK </span></p>
</div>
<br><div class="c4">
<p class="c8"><span class="c2"> MAIN TEXT 1</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">LOAD-DATE: </span><span class="c2">April 26, 2012</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">LANGUAGE: </span><span class="c2">ENGLISH</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">PUBLICATION-TYPE: </span><span class="c2">Newswire</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">Copyright 2012 Associated Press<br>All Rights Reserved</span></p>
</div>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->

我是xpath的新手,我想将其与R(Duncan Lang的xml包)结合使用,以查询从LexisNexis收到的html文档.该文档包含多篇新闻文章,每篇文章都由< DOC NUMBER = 1>< DOCFULL> 标签.我希望为每个文档提取一些信息,例如提取SECTION信息,我到此为止:

I am new to xpath and I want to use it in combination with R (Duncan Lang's xml package) to query a html document that I received from LexisNexis. The document contains multiple news articles and each article is bounded by the <DOC NUMBER=1> <DOCFULL> tags. I wish to extract a couple of information for each document, e.g. to extract the SECTION information, I got this far:

doc <- htmlParse("hmtldoc.HTML")
xpathSApply(doc,"//span[text()='SECTION: ']/..", xmlValue)

这给了我

[1] "SECTION: BUSINESS NEWS" "SECTION: BUSINESS NEWS" "SECTION: BUSINESS NEWS"

这是我可以使用的输出.主要问题在于,并非每篇文章都具有SECTION信息.我需要知道的是哪一篇文章提供了此信息,而哪些则没有,最好返回NA或一个空列表元素,以便我自己推断出该信息.

That is output I can work with. The main problem is that not every article has SECTION information. What I need to know is which article provides this information and which don't, preferably by returning NA or an empty list element so I can deduce this information myself.

与此问题相关:我试图提出一种解决方案,在该解决方案中,我首先选择了DOC或DOCFULL节点,然后从那里继续进行操作,例如:

Associated with this question: I tried to come up with a solution where I selected either the DOC or DOCFULL node first and went on from there, e.g.:

xpathSApply(doc,"//DOCFULL/*/span[text()='SECTION: ']/..", xmlValue)

我认为这应该返回与上面相同的文本,但事实并非如此.无论如何,我对这种语言还是很陌生,感谢您的帮助.

I thought this should return the same text as above, but it doesn't. Anyways, I am still very new to this language and appreciate any help.

推荐答案

由于在 DOCFULL span s之间存在多个级别"的后代元素,您将需要

Because there is more than one 'level' of descendant element between DOCFULL and the spans, you will need to either

含糊

//DOCFULL//*/span[text()='SECTION: ']/..

具体说明级别(div和p)

//DOCFULL/*/*/span[text()='SECTION: ']/..

这篇关于xpath表达式不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆