R:XPath 表达式返回所选元素之外的链接 [英] R: XPath expression returns links outside of selected element

查看:17
本文介绍了R:XPath 表达式返回所选元素之外的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 R 从 该页面上的主表中抓取链接,使用 XPath 语法.主表在页面的第三个,我只想要包含杂志文章的链接.

I am using R to scrape the links from the main table on that page, using XPath syntax. The main table is the third on the page, and I want only the links containing magazine article.

我的代码如下:

require(XML)
(x = htmlParse("http://www.numerama.com/magazine/recherche/125/hadopi/date"))
(y = xpathApply(x, "//table")[[3]])
(z = xpathApply(y, "//table//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href"))
(links = unique(z))

如果您查看输出,最终链接不是来自主表而是来自侧边栏,即使我通过要求对象 y 仅包含在第三行中选择了主表第三张桌子.

If you look at the output, the final links do not come from the main table but from the sidebar, even though I selected the main table in my third line by asking object y to include only the third table.

我做错了什么?使用 XPath 对此进行编码的正确/更有效的方法是什么?

What am I doing wrong? What is the correct/more efficient way to code this with XPath?

注意:XPath 新手写作.

Note: XPath novice writing.

回答(真的很快),非常感谢!我的解决方案如下.

extract <- function(x) {
    message(x)
    html = htmlParse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date"))
    html = xpathApply(html, "//table")[[3]]
    html = xpathApply(html, ".//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href")
    html = gsub("#ac_newscomment", "", html)
    html = unique(html)
}

d = lapply(1:125, extract)
d = unlist(d)
write.table(d, "numerama.hadopi.news.txt", row.names = FALSE)

这会保存指向本网站上带有关键字Hadopi"的新闻项目的所有链接.

This saves all links to news items with keyword 'Hadopi' on this website.

推荐答案

如果要将搜索限制为当前节点,则需要以 开头../ 回到文档的开头(即使根节点不在 y 中).

You need to start the pattern with . if you want to restrict the search to the current node. / goes back to the start of the document (even if the root node is not in y).

xpathSApply(y, ".//a/@href" )

或者,您可以直接使用 XPath 提取第三个表:

Alternatively, you can extract the third table directly with XPath:

xpathApply(x, "//table[3]//a[contains(@href,'/magazine/') and not(contains(@href, '/recherche/'))]/@href")

这篇关于R:XPath 表达式返回所选元素之外的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆