如何使用R中的循环下载多个文件? [英] How to download multiple files using loop in R?

查看:140
本文介绍了如何使用R中的循环下载多个文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须从互联网下载多个关于一个国家的普查数据的xlsx文件。文件位于这个
链接。问题是:


  1. 我是无法编写一个循环,让我来回下载

  2. 正在下载的文件有一些怪异的名字,不是区域名称。那么我怎么可以动态地把它改成区名。

我使用了下面提到的代码:
url< - http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx
download.file(url,HLPCA-28532-2011_H14_census .xlsx,mode =wb)



但是这一次下载一个文件,不会更改文件名。 >

提前感谢

解决方案

假设你想要所有的数据,而不知道全部的URL,您的任务涉及到webparsing。软件包httr为检索给定网站的HTML代码提供了有用的功能,您可以解析链接。



也许这一点代码正在寻找:

 库(httr)

base_url =http://www.censusindia.gov.in / $ / $ / $ / $#$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $
rcl = unlist(strsplit(rc,< a href = \\\))#find links
rcl = rcl [grepl(Houselisting-housing - 。+?\ \.html,rcl)]#找到家庭列表的链接

names = gsub(^。+?>(。+?)< /.+$,\\\ \\,rcl)#获取名称
names = gsub(^ \\s + | \\s + $,,名称)#trim名称
links = gsub( ^(Houselisting-housing - 。+?\\.html)。+ $,\\1,rcl)#获取链接

#迭代地区
for(i in 1:length(links)){
url_hh = paste0(base_url,HL_PCA /,li nks [i])
if(!url_success(url_hh))next

r< - GET(url_hh)
rc = content(r,text)
rcl = unlist(strsplit(rc,< a href = \\\))#find links
rcl = rcl [grepl(。xlsx,rcl)]#查找链接家庭列表

hh_names = gsub(^。+?>(。+?)< /.+$,\\1,rcl)#获取名称
hh_names = gsub(^ \\s + | \\s + $,,hh_names)#trim names
hh_links = gsub(^(。+?\\.xlsx))。 + $,\\1,rcl)#获取链接

#遍历子区域
for(j in 1:length(hh_links)){
url_xlsx = paste0(base_url,HL_PCA /,hh_links [j])
if(!url_success(url_xlsx))next

filename = paste0(names [i],_,hh_names [j],.xlsx)
download.file(url_xlsx,filename,mode =wb)
}
}
pre>

I have to download multiple xlsx files about a country's census data from internet using R. Files are located in this Link .The problems are:

  1. I am unable to write a loop which will let me go back and forth to download
  2. File being download has some weird name not districts name. So how can I change it to districts name dynamically.

I have used the below mentioned codes: url<-"http://www.censusindia.gov.in/2011census/HLO/HL_PCA/HH_PCA1/HLPCA-28532-2011_H14_census.xlsx" download.file(url, "HLPCA-28532-2011_H14_census.xlsx", mode="wb")

But this downloads one file at a time and doesnt change the file name.

Thanks in advance.

解决方案

Assuming you want all the data without knowing all of the urls, your questing involves webparsing. Package httr provides useful function for retrieving HTML-code of a given website, which you can parse for links.

Maybe this bit of code is what you're looking for:

library(httr)

base_url = "http://www.censusindia.gov.in/2011census/HLO/" # main website
r <- GET(paste0(base_url, "HL_PCA/Houselisting-housing-HLPCA.html"))
rc = content(r, "text")
rcl = unlist(strsplit(rc, "<a href =\\\""))   # find links
rcl = rcl[grepl("Houselisting-housing-.+?\\.html", rcl)]  # find links to houslistings

names = gsub("^.+?>(.+?)</.+$", "\\1",rcl)              # get names
names = gsub("^\\s+|\\s+$", "", names)          # trim names
links = gsub("^(Houselisting-housing-.+?\\.html).+$", "\\1",rcl)  # get links

# iterate over regions
for(i in 1:length(links)) {
    url_hh = paste0(base_url, "HL_PCA/", links[i])
    if(!url_success(url_hh)) next

    r <- GET(url_hh)
    rc = content(r, "text")
    rcl = unlist(strsplit(rc, "<a href =\\\""))   # find links
  rcl = rcl[grepl(".xlsx", rcl)]  # find links to houslistings

    hh_names = gsub("^.+?>(.+?)</.+$", "\\1",rcl)          # get names
    hh_names = gsub("^\\s+|\\s+$", "", hh_names)          # trim names
    hh_links = gsub("^(.+?\\.xlsx).+$", "\\1",rcl)   # get links

    # iterate over subregions
    for(j in 1:length(hh_links)) {
        url_xlsx = paste0(base_url, "HL_PCA/",hh_links[j])
      if(!url_success(url_xlsx)) next

        filename = paste0(names[i], "_", hh_names[j], ".xlsx")
        download.file(url_xlsx, filename, mode="wb")
    }
}

这篇关于如何使用R中的循环下载多个文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆