xpath表达式不起作用 [英] xpath expression not working

查看：50 发布时间：2021/5/14 21:02:52 html r xpath

本文介绍了xpath表达式不起作用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

    <DOC NUMBER=1>
<DOCFULL> -->
<br><div class="c0">
<p class="c1"><span class="c2">Dokument 1 von 3</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">Associated Press Financial Wire</span></p>
</div>
<br><div class="c3">
<p class="c1"><span class="c2">April 25, 2012 Wednesday 9:18 PM GMT </span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c6">Apple CEO Tim Cook emerges from Steve Jobs' shadow</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">BYLINE: </span><span class="c2">By PETER SVENSSON, AP Technology Writer</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">SECTION: </span><span class="c2">BUSINESS NEWS</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">LENGTH: </span><span class="c2">794 words</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">DATELINE: </span><span class="c2">NEW YORK </span></p>
</div>
<br><div class="c4">
<p class="c8"><span class="c2"> MAIN TEXT 1</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">LOAD-DATE: </span><span class="c2">April 26, 2012</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">LANGUAGE: </span><span class="c2">ENGLISH</span></p>
</div>
<br><div class="c4">
<p class="c5"><span class="c7">PUBLICATION-TYPE: </span><span class="c2">Newswire</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">Copyright 2012 Associated Press<br>All Rights Reserved</span></p>
</div>
<!-- Hide XML section from browser
</DOCFULL>
</DOC> -->

我是xpath的新手，我想将其与R(Duncan Lang的xml包)结合使用，以查询从LexisNexis收到的html文档.该文档包含多篇新闻文章，每篇文章都由< DOC NUMBER = 1>< DOCFULL> 标签.我希望为每个文档提取一些信息，例如提取SECTION信息，我到此为止:

I am new to xpath and I want to use it in combination with R (Duncan Lang's xml package) to query a html document that I received from LexisNexis. The document contains multiple news articles and each article is bounded by the <DOC NUMBER=1> <DOCFULL> tags. I wish to extract a couple of information for each document, e.g. to extract the SECTION information, I got this far:

doc <- htmlParse("hmtldoc.HTML")
xpathSApply(doc,"//span[text()='SECTION: ']/..", xmlValue)

这给了我

[1] "SECTION: BUSINESS NEWS" "SECTION: BUSINESS NEWS" "SECTION: BUSINESS NEWS"

这是我可以使用的输出.主要问题在于，并非每篇文章都具有SECTION信息.我需要知道的是哪一篇文章提供了此信息，而哪些则没有，最好返回NA或一个空列表元素，以便我自己推断出该信息.

That is output I can work with. The main problem is that not every article has SECTION information. What I need to know is which article provides this information and which don't, preferably by returning NA or an empty list element so I can deduce this information myself.

与此问题相关:我试图提出一种解决方案，在该解决方案中，我首先选择了DOC或DOCFULL节点，然后从那里继续进行操作，例如:

Associated with this question: I tried to come up with a solution where I selected either the DOC or DOCFULL node first and went on from there, e.g.:

xpathSApply(doc,"//DOCFULL/*/span[text()='SECTION: ']/..", xmlValue)

我认为这应该返回与上面相同的文本，但事实并非如此.无论如何，我对这种语言还是很陌生，感谢您的帮助.

I thought this should return the same text as above, but it doesn't. Anyways, I am still very new to this language and appreciate any help.

xpath表达式不起作用 [英] xpath expression not working

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

xpath表达式不起作用 [英] xpath expression not working

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭