使用setuptools,如何在安装后下载外部数据? [英] Using setuptools, how can I download external data upon installation?

查看:62
本文介绍了使用setuptools,如何在安装后下载外部数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想创建一些易于使用的pip包,用于在Python中加载常见的机器学习数据集. (是的,已经有一些东西了,但是我希望它变得更简单.)

I'd like to create some ridiculously-easy-to-use pip packages for loading common machine-learning datasets in Python. (Yes, some stuff already exists, but I want it to be even simpler.)

我想实现的是这样:

  • 用户运行pip install dataset
  • pip下载数据集,例如通过wget http://mydata.com/data.tar.gz进行下载.请注意,数据并不驻留在python包本身中,而是从其他地方下载的.
  • pip从此文件中提取数据,并将其放入安装包的目录中.(这不是理想的选择,但是数据集很小,因此让我们假设将数据存储在此处没什么大不了的.)
  • 稍后,当用户导入我的模块时,该模块会自动从特定位置加载数据.
  • User runs pip install dataset
  • pip downloads the dataset, say via wget http://mydata.com/data.tar.gz. Note that the data does not reside in the python package itself, but is downloaded from somewhere else.
  • pip extracts the data from this file and puts it in the directory that the package is installed in. (This isn't ideal, but the datasets are pretty small, so let's assume storing the data here isn't a big deal.)
  • Later, when the user imports my module, the module automatically loads the data from the specific location.

这个问题与项目符号2和3有关.是否可以使用setuptools来做到这一点?

This question is about bullets 2 and 3. Is there a way to do this with setuptools?

推荐答案

凯文(Kevin)暗示,Python软件包的安装应该是完全可复制的,任何潜在的外部下载问题都应推送到运行时.因此,不应使用setuptools处理此问题.

As alluded to by Kevin, Python package installs should be completely reproducible, and any potential external-download issues should be pushed to runtime. This therefore shouldn't be handled with setuptools.

相反,为避免给用户造成负担,请考虑在加载时以惰性方式下载数据.示例:

Instead, to avoid burdening the user, consider downloading the data in a lazy way, upon load. Example:

def download_data(url='http://...'):
    # Download; extract data to disk.
    # Raise an exception if the link is bad, or we can't connect, etc.

def load_data():
    if not os.path.exists(DATA_DIR):
        download_data()
    data = read_data_from_disk(DATA_DIR)
    return data

然后我们可以在文档中描述download_data,但是大多数用户永远都不需要理会它.这与imageio模块中的有关在运行时下载必要的解码器的行为有些相似,而不是让用户自己管理外部下载.

We could then describe download_data in the docs, but the majority of users would never need to bother with it. This is somewhat similar to the behavior in the imageio module with respect to downloading necessary decoders at runtime, rather than making the user manage the external downloads themselves.

这篇关于使用setuptools,如何在安装后下载外部数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆