在安装过程中从Internet下载数据的软件包 [英] Package that downloads data from the internet during installation

查看:82
本文介绍了在安装过程中从Internet下载数据的软件包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有人知道某个软件包会在安装过程中从Internet下载数据集,然后准备并保存该数据集,以便在使用library(packageName)加载软件包时可以使用它?这种方法是否有任何缺点(除了很明显的一种缺点,即如果数据源不可用或数据格式已更改,则包安装将失败)?

Is anyone aware of a package that downloads a dataset from the internet during the installation process and then prepares and saves it so that it is available when loading the package using library(packageName)? Are there any drawbacks in this approach (besides the obvious one that package installation will fail if the data source is unavailable or the data format has changed)?

编辑:有些背景.数据是ZIP归档文件中由三个制表符分隔的文件,归联邦统计所有,并且通常可以免费访问.我有R代码,可以下载,提取和准备数据,最后创建了三个数据帧,可以以.RData格式保存.

EDIT: Some background. The data is three tab-separated files in a ZIP archive, owned by federal statistics and generally freely accessible. I have R code which downloads, extracts and prepares the data, in the end three data frames are created which could be saved in .RData format.

我正在考虑创建两个程序包:提供数据的数据"程序包和对其进行操作的代码"程序包.

I am thinking about creating two packages: A "data" package that provides the data, and a "code" package that operates on it.

推荐答案

在您发布编辑内容之前,我做了此模型.我认为它可以工作,但未经测试.我已对此进行了评论,以便您可以看到需要更改的内容.这里的想法是检查在当前工作环境中是否存在预期的对象.如果不是,请检查可在其中找到数据的文件是否在当前工作目录中.如果找不到,请提示用户下载文件,然后从那里继续.

I did this mockup before, while you were posting your edit. I presume it would work, but not tested. I've commented it so you can see what you would need to change. The idea here is to check to see if an expected object is available in the current working environment. If it is not, check to see that the file that the data can be found in is in the current working directory. If that is not found, prompt the user to download the file, then proceed from there.

myFunction <- function(this, that, dataset) {

  # We're giving the user a chance to specify the dataset.
  #   Maybe they have already downloaded it and saved it.
  if (is.null(dataset)) {

    # Check to see if the object is already in the workspace.
    # If it is not, check to see whether the .RData file that
    #   contains the object is in the current working directory.
    if (!exists("OBJECTNAME", where = 1)) {
      if (isTRUE(list.files(
        pattern = "^DATAFILE.RData$") == "DATAFILE.RData")) {
        load("DATAFILE.RData")

        # If neither of those are successful, prompt the user
        #   to download the dataset.
      } else {
        ans = readline(
          "DATAFILE.RData dataset not found in working directory.
          OBJECTNAME object not found in workspace. \n
          Download and load the dataset now? (y/n) ")
        if (ans != "y")
          return(invisible())

        # I usually use RCurl in case the URL is https
        require(RCurl)
        baseURL = c("http://some/base/url/")

        # Here, we actually download the data
        temp = getBinaryURL(paste0(baseURL, "DATAFILE.RData"))

        # Here we load the data
        load(rawConnection(temp), envir=.GlobalEnv)
        message("OBJECTNAME data downloaded from \n",
                paste0(baseURL, "DATAFILE.RData \n"), 
                "and added to your workspace\n\n")
        rm(temp, baseURL)
      }
    }
    dataset <- OBJECTNAME
  }
  TEMP <- dataset
  ## Other fun stuff with TEMP, this, and that.
}


两个软件包,托管在Github上

这是另一种方法,以@juba和I之间的注释为基础.正如您所描述的,基本概念是拥有一个用于代码的包和一个用于数据的包.该函数将成为包含您的代码的软件包的一部分.它将:


Two packages, hosted at Github

Here's another approach, building on the comments between @juba and I. The basic concept is to have, as you describe, one package for the codes and one for the data. This function would be part of the package that contains your code. It will:

  1. 检查是否已安装数据包
  2. 检查已安装的数据包的版本是否与Github上的版本匹配,我们将假定这是最新版本.

当任何一项检查失败时,它将询问用户是否要更新其软件包的安装.在这种情况下,为了进行演示,我已链接到我在Github上正在进行的软件包中的一个.这应该让您了解将其托管在自己的程序包中之后,需要进行哪些替换才能使其与您自己的程序包一起工作.

When it fails any of the checks, it asks the user if they want to update their installation of the package. In this case, for demonstration, I've linked to one of my packages in progress at Github. This should give you an idea of what you need to substitute to get it to work with your own package once you've hosted it there.

CheckVersionFirst <- function() {
  # Check to see if installed
  if (!"StataDCTutils" %in% installed.packages()[, 1]) {
    Checks <- "Failed"
  } else {
    # Compare version numbers
    require(RCurl)
    temp <- getURL("https://raw.github.com/mrdwab/StataDCTutils/master/DESCRIPTION")
    CurrentVersion <- gsub("^\\s|\\s$", "", 
                           gsub(".*Version:(.*)\\nDate.*", "\\1", temp))
    if (packageVersion("StataDCTutils") == CurrentVersion) {
      Checks <- "Passed"
    }
    if (packageVersion("StataDCTutils") < CurrentVersion) {
      Checks <- "Failed"
    }
  }

  switch(
    Checks,
    Passed = { message("Everything looks OK! Proceeding!") },
    Failed = {
      ans = readline(
        "'StataDCTutils is either outdated or not installed. Update now? (y/n) ")
      if (ans != "y")
        return(invisible())
      require(devtools)
      install_github("StataDCTutils", "mrdwab")
    })
# Some cool things you want to do after you are sure the data is there
}

使用CheckVersionFirst()试试.

注意:只有在每次将新版本的数据推送到Github时都牢记记住更新描述文件中的版本号时,此操作才会成功!

Note: This would succeed only if you religiously remember to update your version number in your description file every time you push a new version of the data to Github!

因此,为了澄清/概述/扩展,基本思想是:

So, to clarify/recap/expand, the basic idea would be to:

  • 定期将您的 data 软件包的更新版本推送到Github,并确保更改 data 打包到其DESCRIPTION文件中.
  • 将此CheckVersionFirst()函数集成为 代码 包中的.onLoad事件. (显然,请修改该功能以使其与您的帐户和包裹名称匹配).
  • 更改显示为# Some cool things you want to do after you are sure the data is there的注释行以反映您实际想要做的很酷的事情,这很可能以library(YOURDATAPACKAGE)开头以加载数据....
  • Periodically push the updated version of your data package to Github, being sure to change the version number of the data package in its DESCRIPTION file when you do so.
  • Integrate this CheckVersionFirst() function as an .onLoad event in your code package. (Obviously modify the function to match your account and package name).
  • Change the commented line that reads # Some cool things you want to do after you are sure the data is there to reflect the cool things you actually want to do, which would probably start with library(YOURDATAPACKAGE) to load the data....

这篇关于在安装过程中从Internet下载数据的软件包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆