从网站下载具有特定扩展名的文件 [英] Download files with specific extension from a website

查看:102
本文介绍了从网站下载具有特定扩展名的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何下​​载网页的内容并查找列出具有特定扩展名的所有文件.然后下载所有这些.例如,我想从以下网页下载所有netcdf文件(扩展名为* .nc4): https://data.giss.nasa.gov/impacts/agmipcf/agmerra/.

How can I download the content of a webpage and find all files with specific extension listed on it. And then download all of them. For example, I would like to download all netcdf files (with extension *.nc4) from the following webpage: https://data.giss.nasa.gov/impacts/agmipcf/agmerra/.

建议我查看Rcurl程序包,但找不到该怎么做.

I was recommended to look into Rcurl package but could not find how to do this.

推荐答案

library(stringr)

# Get the context of the page
thepage = readLines('https://data.giss.nasa.gov/impacts/agmipcf/agmerra/')

# Find the lines that contain the names for netcdf files
nc4.lines <- grep('*.nc4', thepage) 

# Subset the original dataset leaving only those lines
thepage <- thepage[nc4.lines]

#extract the file names
str.loc <- str_locate(thepage,'A.*nc4?"')

#substring
file.list <- substring(thepage,str.loc[,1], str.loc[,2]-1)

# download all files
for ( ifile in file.list){
 download.file(paste0("https://data.giss.nasa.gov/impacts/agmipcf/agmerra/",
                      ifile),
               destfile=ifile, method="libcurl")

这篇关于从网站下载具有特定扩展名的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆