从Web将文件名列表读入R [英] Read list of file names from web into R

查看:45
本文介绍了从Web将文件名列表读入R的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网站将大量csv文件读取到R中.威胁是多年的每日(仅工作日)文件.所有文件都具有相同的数据结构.我可以使用以下逻辑成功读取一个文件:

I am trying to read a lot of csv files into R from a website. Threa are multiple years of daily (business days only) files. All of the files have the same data structure. I can sucessfully read one file using the following logic:

# enter user credentials
user     <- "JohnDoe"
password <- "SecretPassword"
credentials <- paste(user,":",password,"@",sep="")
web.site <- "downloads.theice.com/Settlement_Reports_CSV/Power/"

# construct path to data
path <- paste("https://", credentials, web.site, sep="")

# read data for 4/10/2013
file  <- "icecleared_power_2013_04_10"
fname <- paste(path,file,".dat",sep="")
df <- read.csv(fname,header=TRUE, sep="|",as.is=TRUE)

但是,我正在寻找有关如何一次读取目录中所有文件的提示.我想我可以在循环中生成日期序列来构造上面的文件名,并使用rbind附加每个文件,但这看起来很麻烦.另外,尝试在没有文件的周末和节假日读取内容时也会出现问题.

However, Im looking for tips on how to read all the files in the directory at once. I suppose I could generate a sequence of dates an construct the file name above in a loop and use rbind to append each file but that seems cumbersome. Plus there will be issues when attempting to read weekends and holidays where there is no files.

以下提示显示了网络浏览器中文件列表的外观:

The impages below show what the list of files look like in the web browser:

.........

... ... ...

是否有一种方法(从上方)扫描路径以获取目录中所有符合certin crieteia的文件名的列表(即,以"icecleared_power_"开头,因为该位置还存在一些文件有一个我不想读入的不同的起始名称),然后在该列表中循环读取read.csv并使用rbind追加?

Is there a way to scan the path (from above) to get a list of all the file names in the directory first that meet a certin crieteia (i.e. start with "icecleared_power_" as there are also some files in that location that have a different starting name that I do not want to read in) then loop the read.csv through that list and use rbind to append?

任何指导将不胜感激?

推荐答案

我将首先尝试仅抓取相关数据文件的链接,然后使用所得信息来构建包括用户登录名等的完整下载路径.正如其他人所建议的那样, lapply 将便于批量下载.

I would first try to just scrape the links to the relevant data files and use the resulting information to construct the full download path that includes user logins and so on. As others have suggested, lapply would be convenient for batch downloading.

这是提取URL的简单方法.显然,修改示例以适合您的实际情况.

Here's an easy way to extract the URLs. Obviously, modify the example to suit your actual scenario.

在这里,我们将使用 XML 包来标识Amelia包(

Here, we're going to use the XML package to identify all the links available at the CRAN archives for the Amelia package (http://cran.r-project.org/src/contrib/Archive/Amelia/).

> library(XML)
> url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/@href")
> free(doc)
> links
                   href                    href                    href 
             "?C=N;O=D"              "?C=M;O=A"              "?C=S;O=A" 
                   href                    href                    href 
             "?C=D;O=A" "/src/contrib/Archive/"  "Amelia_1.1-23.tar.gz" 
                   href                    href                    href 
 "Amelia_1.1-29.tar.gz"  "Amelia_1.1-30.tar.gz"  "Amelia_1.1-32.tar.gz" 
                   href                    href                    href 
 "Amelia_1.1-33.tar.gz"   "Amelia_1.2-0.tar.gz"   "Amelia_1.2-1.tar.gz" 
                   href                    href                    href 
  "Amelia_1.2-2.tar.gz"   "Amelia_1.2-9.tar.gz"  "Amelia_1.2-12.tar.gz" 
                   href                    href                    href 
 "Amelia_1.2-13.tar.gz"  "Amelia_1.2-14.tar.gz"  "Amelia_1.2-15.tar.gz" 
                   href                    href                    href 
 "Amelia_1.2-16.tar.gz"  "Amelia_1.2-17.tar.gz"  "Amelia_1.2-18.tar.gz" 
                   href                    href                    href 
  "Amelia_1.5-4.tar.gz"   "Amelia_1.5-5.tar.gz"   "Amelia_1.6.1.tar.gz" 
                   href                    href                    href 
  "Amelia_1.6.3.tar.gz"   "Amelia_1.6.4.tar.gz"     "Amelia_1.7.tar.gz" 

为演示起见,请想象一下,最终,我们只需要软件包1.2版本的链接.

For the sake of demonstration, imagine that, ultimately, we only want the links for the 1.2 versions of the package.

> wanted <- links[grepl("Amelia_1\\.2.*", links)]
> wanted
                  href                   href                   href 
 "Amelia_1.2-0.tar.gz"  "Amelia_1.2-1.tar.gz"  "Amelia_1.2-2.tar.gz" 
                  href                   href                   href 
 "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz" 
                  href                   href                   href 
"Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz" 
                  href                   href 
"Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz" 

您现在可以按如下方式使用该向量:

You can now use that vector as follows:

wanted <- links[grepl("Amelia_1\\.2.*", links)]
GetMe <- paste(url, wanted, sep = "")
lapply(seq_along(GetMe), 
       function(x) download.file(GetMe[x], wanted[x], mode = "wb"))


更新(以在评论中澄清您的问题)

上面示例中的最后一步下载到您的当前工​​作目录中(使用 getwd()验证该文件在哪里).相反,如果您确定 read.csv 对数据有效,则还可以尝试修改匿名函数以直接读取文件:


Update (to clarify your question in comments)

The last step in the example above downloads the specified files to your current working directory (use getwd() to verify where that is). If, instead, you know for sure that read.csv works on the data, you can also try to modify your anonymous function to read the files directly:

lapply(seq_along(GetMe), 
       function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE))

但是,我认为更安全的方法可能是先将所有文件下载到一个目录中,然后再使用 read.delim read.csv或任何可以读取数据的方法,类似于@Andreas的建议.我说更安全,是因为它为您提供了更大的灵活性,以防文件未完全下载等.在这种情况下,您不必下载所有内容,而只需下载未完全下载的文件.

However, I think a safer approach might be to download all the files into a single directory first, and then use read.delim or read.csv or whatever works to read in the data, similar to as was suggested by @Andreas. I say safer because it gives you more flexibility in case files aren't fully downloaded and so on. In that case, instead of having to redownload everything, you would only need to download the files which were not fully downloaded.

这篇关于从Web将文件名列表读入R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆