在目录中导入最新的 csv 文件 [英] Import newest csv file in directory
问题描述
目标:
- 将本地目录中的最新文件 (.csv) 导入 R
Goal:
- Import the newest file (.csv) from a local directory into R
目标详情:
- 每天在我的 Mac 上将一个 csv 文件上传到一个文件夹.我希望能够在我的 R 脚本中加入一个函数,该函数会自动将最新文件导入我的工作区以进行进一步分析.该文件每天凌晨 4:30 左右上传
- 我希望这个功能在早上运行(不早于早上 6 点,所以这里有足够的时间)
Goal Details:
- A csv file is uploaded to a folder daily on my Mac. I would like to be able to incorporate a function in my R script that automatically imports the newest file into my workspace for further analysis. The file is uploaded daily around 4:30AM
- I would like this function to be run in the morning (no earlier than 6AM so there's plenty of time for leeway here)
输入详细信息:
- 文件类型:.csv
- 命名约定:示例文件名:28 Jul 2014 04:37:47 -0400.csv"
- 频率:每日导入@~04:30
Input Details:
- file type: .csv
- naming convention: example file name: "28 Jul 2014 04:37:47 -0400.csv"
- frequency: daily import @ ~ 04:30
我的尝试:
- 我知道这似乎是一个微弱的尝试,但我真的不知道如何修改下面的这个功能.
- 我在纸上的想法是抓取"最新文件的 id,而不是将其粘贴()在目录名称前面,然后是中提琴!(但遗憾的是,我的编程技能缺乏在这里编写代码)
- 下面的代码是试图运行的代码,但它只是挂起"并且没有完成.我从这个 R 论坛在这里找到
代码:
lastChange = file.info(directory)$mtime
while(TRUE){
currentM = file.info(directory)$mtime
if(currentM != lastChange){
lastChange = currentM
read.csv(directory)
}
# try again in 10 minutes
Sys.sleep(600)
}
我的环境:
- R 3.1
- Mac OS X 10.9.4(小牛队)
My Environment:
- R 3.1
- Mac OS X 10.9.4 (Mavericks)
非常感谢您的帮助!:-)
Thank you so much in advance for any help! :-)
推荐答案
以下函数使用时间戳文件来跟踪"已使用时间戳文件处理过的文件.它可以在 R 实例中连续运行(如您最初建议的那样),也可以通过单次运行实例的方式运行,借给 @andrew 的 cron 作业建议.(cat()
命令主要用于测试;随意删除它.)
The following function uses a timestamp file to "keep track" of files that have been processed with the use of a timestamp file. It can be run either continually in an R instance (as you first suggested), or by way of single-run instances, lending to @andrew's suggestion of a cron job. (The cat()
command is included primarily for testing; feel free to remove it.)
processDir <- function(directory = '.', pattern = '*.csv', loop = FALSE, delay = 600,
stampFile = file.path(directory, '.csvProcessor')) {
if (! file.exists(stampFile))
file.create(stampFile)
firstRun <- TRUE
while (firstRun || loop) {
firstRun <- FALSE
stampTime <- file.info(stampFile)$mtime
allFilesDF <- file.info(list.files(path = directory, pattern = pattern,
full.names = TRUE, no.. = TRUE))
unprocessedFiles <- allFilesDF[(! allFilesDF$isdir) &
(allFilesDF$mtime > stampTime), ]
if (nrow(unprocessedFiles)) {
## We need to update the timestamp on stampFile quickly so
## that files added while this is running will be found in the
## next loop.
## WARNING: this blindly truncates the stampFile.
file.create(stampFile, showWarnings = FALSE)
for (fn in rownames(unprocessedFiles)) {
cat('Processing ', fn, '
')
## read.csv(fn)
## ...
}
}
if (loop) Sys.sleep(delay)
}
}
正如您最初建议的那样,在持续运行的 R 实例中运行它很简单:
As you initially suggested, running it in a continually-running R instance would simply be:
processDir(loop = TRUE)
要使用@andrew 对 cron 作业的建议,请在函数定义后附加以下行:
To use @andrew's suggestion of a cron job, append the following line after the function definition:
processDir()
... 并使用类似于以下内容的 crontab 文件:
... and use a crontab file similar to the following:
# crontab
0 8 * * * path/to/Rscript path/to/processDir.R
希望这会有所帮助.
这篇关于在目录中导入最新的 csv 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!