使用R从网页中抓取可下载文件的链接地址? [英] Using R to scrape the link address of a downloadable file from a web page?

查看:58
本文介绍了使用R从网页中抓取可下载文件的链接地址?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试自动化一个过程,该过程涉及从几个网页下载 .zip 文件并提取它们包含的 .csvs.挑战在于 .zip 文件名以及链接地址每周或每年更改一次,具体取决于页面.有没有办法从这些页面中抓取当前链接地址,以便我可以将这些地址提供给下载文件的函数?

I'm trying to automate a process that involves downloading .zip files from a couple of web pages and extracting the .csvs they contain. The challenge is that the .zip file names, and thus the link addresses, change weekly or annually, depending on the page. Is there a way to scrape the current link addresses from those pages so I can then feed those addresses to a function that downloads the files?

目标页面之一是这个.我要下载的文件是标题2015 Realtime Complete All Africa File"下的第二个项目符号——即压缩的 .csv.在我撰写本文时,该文件在网页上标记为实时 2015 年全非洲文件(2015 年 7 月 11 日更新)(csv)",我想要的链接地址是 http://www.acleddata.com/wp-content/uploads/2015/07/ACLED-All-Africa-File_20150101-to-20150711_csv.zip,但今天晚些时候应该会改变,因为数据每周一更新——这就是我的挑战.

One of the target pages is this one. The file I want to download is the second bullet under the header "2015 Realtime Complete All Africa File"---i.e., the zipped .csv. As I write, that file is labeled "Realtime 2015 All Africa File (updated 11th July 2015)(csv)" on the web page, and the link address that I want is http://www.acleddata.com/wp-content/uploads/2015/07/ACLED-All-Africa-File_20150101-to-20150711_csv.zip, but that should change later today, because the data are updated each Monday---hence my challenge.

我尝试过但未能在 Chrome 中使用rvest"和 selectorgadet 扩展自动提取该 .zip 文件名.事情的经过是这样的:

I tried but failed to automate extraction of that .zip file name with 'rvest' and the selectorgadet extension in Chrome. Here's how that went:

> library(rvest)
> realtime.page <- "http://www.acleddata.com/data/realtime-data-2015/"
> realtime.html <- html(realtime.page)
> realtime.link <- html_node(realtime.html, xpath = "//ul[(((count(preceding-sibling::*) + 1) = 7) and parent::*)]//li+//li//a")
> realtime.link
[1] NA

在对 html_node() 的调用中的 xpath 来自仅以绿色突出显示 Realtime 2015 All Africa File(2015 年 7 月 11 日更新)(csv) 字段的 (csv) 部分,然后单击在页面上足够多的其他突出显示位上消除所有黄色,只留下红色和绿色.

The xpath in that call to html_node() came from highlighting just the (csv) portion of the Realtime 2015 All Africa File (updated 11th July 2015)(csv) field in green and then clicking on enough other highlighted bits of the page to eliminate all the yellow and leave only red and green.

我在这个过程中犯了一个小错误,还是我完全走错了路?如您所知,我对 HTML 和网页抓取的经验为零,因此非常感谢您的帮助.

Did I make a small mistake in that process, or am I just entirely on the wrong track here? As you can tell, I have zero experience with HTML and web-scraping, so I'd really appreciate some assistance.

推荐答案

我认为您试图在单个 xpath 表达式中做太多事情 - 我会通过一系列较小的步骤来解决这个问题:

I think you're trying to do too much in a single xpath expression - I'd attack the problem in a sequence of smaller steps:

library(rvest)
library(stringr)
page <- html("http://www.acleddata.com/data/realtime-data-2015/")

page %>%
  html_nodes("a") %>%       # find all links
  html_attr("href") %>%     # get the url
  str_subset("\\.xlsx") %>% # find those that end in xlsx
  .[[1]]                    # look at the first one

这篇关于使用R从网页中抓取可下载文件的链接地址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆