ec2上的dask s3访问 [英] dask s3 access on ec2 workers

查看:48
本文介绍了ec2上的dask s3访问的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用具有正确IAM角色的ec2实例上运行的工作程序从s3中读取许多csv文件(我可以从其他脚本的相同存储桶中读取数据).当我尝试使用以下命令从私有存储桶读取自己的数据时:

I try to read a lot of csv files from s3 with workers running on ec2 instances with the right IAM roles (I can read from the same buckets from other scripts). When I try to read my own data from a private bucket with this command:

client = Client('scheduler-on-ec2')
df = read_csv('s3://xyz/*csv.gz',
              compression='gzip',
              blocksize=None,
              #storage_options={'key': '', 'secret': ''}
             )
df.size.compute()

数据看起来像是本地读取的(由本地python解释器而非工作程序读取),然后由本地解释器发送给工作程序(或调度程序?),当工作程序接收到块时,他们运行计算并返回结果.不管是否通过 storage_options 传递密钥和机密.

Data is look like read locally (by local python interpreter, not the workers), then is sent to workers (or scheduler?) by the local interpreter, and when the workers receive the chunks, they run the compute and return the results. Same with or without passing the key and the secret via storage_options.

当我从公共s3存储桶(纽约出租车数据)中读取时,使用 storage_options = {'anon':True} ,一切看起来都很好.

When I read from a public s3 bucket (nyc taxi data), with storage_options={'anon': True}, everything looks okay.

您认为问题是什么?我应该重新配置更改以使工作人员直接从s3中读取内容吗?

What do you think the problem is and what should I reconfigure change to get the workers read directly from s3?

s3fs已正确安装,根据dask,这些是受支持的文件系统:

s3fs is installed correctly, and these are the supported filesystems according to dask:

>>>> dask.bytes.core._filesystems
{'file': dask.bytes.local.LocalFileSystem,
 's3': dask.bytes.s3.DaskS3FileSystem}

更新

在监视网络接口之后,似乎有些东西已从解释器上载到调度程序.数据帧(或数据包)中的分区越多,数据发送到调度程序的数量就越大.我以为它可以是计算图,但它确实很大.对于12个文件,它是2-3MB,对于30个文件,它是20MB,对于较大的数据(150个文件),将其发送到调度程序的时间太长了,我没有等待它.还有什么其他东西可以发送给调度程序,它们可以占用大量数据?

After monitoring network interfaces, it looks like something is uploaded from the interpreter to the scheduler. The more partitions there are in the dataframe (or bag), the bigger the data is sent to scheduler. I thought it could be the computation graph, but it is really big. For 12 files, it is 2-3MB, for 30 files it is 20MB and for larger data, (150 files) it just takes too long to send it to the scheduler and I didn't wait it. What else is being sent to the scheduler that can take up this amount of data?

推荐答案

调用 dd.read_csv('s3://...')时,本地计算机将读取一些内容.以便猜测列名称,dtype等.然而,工作人员将直接读取大部分数据.

When you call dd.read_csv('s3://...') the local machine will read a little bit of the data in order to guess column names, dtypes, etc.. However the workers will read in the majority of the data directly.

使用分布式调度程序时,Dask不会将数据加载到本地计算机中,然后将其泵出给工作人员.正如您所建议的那样,这样效率低下.

When using the distributed scheduler, Dask does not load data in the local machine and then pump it out to the workers. As you suggest, this would be inefficient.

您可能希望查看网络诊断页以获取更多信息有关花费时间的信息.

You might want to look at the web diagnostic pages to get more information about what is taking time.

这篇关于ec2上的dask s3访问的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆