在目录中导入最新的 csv 文件 [英] Import newest csv file in directory

查看:32
本文介绍了在目录中导入最新的 csv 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目标:
- 将本地目录中的最新文件 (.csv) 导入 R

Goal:
- Import the newest file (.csv) from a local directory into R

目标详情:
- 每天在我的 Mac 上将一个 csv 文件上传到一个文件夹.我希望能够在我的 R 脚本中加入一个函数,该函数会自动将最新文件导入我的工作区以进行进一步分析.该文件每天凌晨 4:30 左右上传
- 我希望这个功能在早上运行(不早于早上 6 点,所以这里有足够的时间)

Goal Details:
- A csv file is uploaded to a folder daily on my Mac. I would like to be able to incorporate a function in my R script that automatically imports the newest file into my workspace for further analysis. The file is uploaded daily around 4:30AM
- I would like this function to be run in the morning (no earlier than 6AM so there's plenty of time for leeway here)

输入详细信息:
- 文件类型:.csv
- 命名约定:示例文件名:28 Jul 2014 04:37:47 -0400.csv"
- 频率:每日导入@~04:30

Input Details:
- file type: .csv
- naming convention: example file name: "28 Jul 2014 04:37:47 -0400.csv"
- frequency: daily import @ ~ 04:30

我的尝试:
- 我知道这似乎是一个微弱的尝试,但我真的不知道如何修改下面的这个功能.
- 我在纸上的想法是抓取"最新文件的 id,而不是将其粘贴()在目录名称前面,然后是中提琴!(但遗憾的是,我的编程技能缺乏在这里编写代码)
- 下面的代码是试图运行的代码,但它只是挂起"并且没有完成.我从这个 R 论坛在这里找到

代码:

lastChange = file.info(directory)$mtime 
while(TRUE){ 
  currentM = file.info(directory)$mtime 
  if(currentM != lastChange){ 
    lastChange = currentM 
    read.csv(directory) 
  } 
  # try again in 10 minutes 
  Sys.sleep(600) 
} 

我的环境:
- R 3.1
- Mac OS X 10.9.4(小牛队)

My Environment:
- R 3.1
- Mac OS X 10.9.4 (Mavericks)

非常感谢您的帮助!:-)

Thank you so much in advance for any help! :-)

推荐答案

以下函数使用时间戳文件来跟踪"已使用时间戳文件处理过的文件.它可以在 R 实例中连续运行(如您最初建议的那样),也可以通过单次运行实例的方式运行,借给 @andrew 的 cron 作业建议.(cat() 命令主要用于测试;随意删除它.)

The following function uses a timestamp file to "keep track" of files that have been processed with the use of a timestamp file. It can be run either continually in an R instance (as you first suggested), or by way of single-run instances, lending to @andrew's suggestion of a cron job. (The cat() command is included primarily for testing; feel free to remove it.)

processDir <- function(directory = '.', pattern = '*.csv', loop = FALSE, delay = 600,
                       stampFile = file.path(directory, '.csvProcessor')) {
    if (! file.exists(stampFile))
        file.create(stampFile)
    firstRun <- TRUE
    while (firstRun || loop) {
        firstRun <- FALSE
        stampTime <- file.info(stampFile)$mtime
        allFilesDF <- file.info(list.files(path = directory, pattern = pattern,
                                           full.names = TRUE, no.. = TRUE))
        unprocessedFiles <- allFilesDF[(! allFilesDF$isdir) &
                                       (allFilesDF$mtime > stampTime), ]
        if (nrow(unprocessedFiles)) {
            ## We need to update the timestamp on stampFile quickly so
            ## that files added while this is running will be found in the
            ## next loop.
            ## WARNING: this blindly truncates the stampFile.
            file.create(stampFile, showWarnings = FALSE)
            for (fn in rownames(unprocessedFiles)) {
                cat('Processing ', fn, '
')
                ## read.csv(fn)
                ## ...
            }
        }
        if (loop) Sys.sleep(delay)
    }
}

正如您最初建议的那样,在持续运行的 R 实例中运行它很简单:

As you initially suggested, running it in a continually-running R instance would simply be:

processDir(loop = TRUE)

要使用@andrew 对 cron 作业的建议,请在函数定义后附加以下行:

To use @andrew's suggestion of a cron job, append the following line after the function definition:

processDir()

... 并使用类似于以下内容的 crontab 文件:

... and use a crontab file similar to the following:

# crontab
0 8 * * * path/to/Rscript path/to/processDir.R

希望这会有所帮助.

这篇关于在目录中导入最新的 csv 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆