ec2上的dask s3访问 [英] dask s3 access on ec2 workers

查看：48 发布时间：2021/4/28 19:34:40 python amazon-s3 dask

本文介绍了ec2上的dask s3访问的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试使用具有正确IAM角色的ec2实例上运行的工作程序从s3中读取许多csv文件(我可以从其他脚本的相同存储桶中读取数据).当我尝试使用以下命令从私有存储桶读取自己的数据时:

I try to read a lot of csv files from s3 with workers running on ec2 instances with the right IAM roles (I can read from the same buckets from other scripts). When I try to read my own data from a private bucket with this command:

client = Client('scheduler-on-ec2')
df = read_csv('s3://xyz/*csv.gz',
              compression='gzip',
              blocksize=None,
              #storage_options={'key': '', 'secret': ''}
             )
df.size.compute()

数据看起来像是本地读取的(由本地python解释器而非工作程序读取)，然后由本地解释器发送给工作程序(或调度程序?)，当工作程序接收到块时，他们运行计算并返回结果.不管是否通过 storage_options 传递密钥和机密.

Data is look like read locally (by local python interpreter, not the workers), then is sent to workers (or scheduler?) by the local interpreter, and when the workers receive the chunks, they run the compute and return the results. Same with or without passing the key and the secret via storage_options.

当我从公共s3存储桶(纽约出租车数据)中读取时，使用 storage_options = {'anon':True} ，一切看起来都很好.

When I read from a public s3 bucket (nyc taxi data), with storage_options={'anon': True}, everything looks okay.

您认为问题是什么?我应该重新配置更改以使工作人员直接从s3中读取内容吗?

What do you think the problem is and what should I reconfigure change to get the workers read directly from s3?

s3fs已正确安装，根据dask，这些是受支持的文件系统:

s3fs is installed correctly, and these are the supported filesystems according to dask:

>>>> dask.bytes.core._filesystems
{'file': dask.bytes.local.LocalFileSystem,
 's3': dask.bytes.s3.DaskS3FileSystem}

更新

在监视网络接口之后，似乎有些东西已从解释器上载到调度程序.数据帧(或数据包)中的分区越多，数据发送到调度程序的数量就越大.我以为它可以是计算图，但它确实很大.对于12个文件，它是2-3MB，对于30个文件，它是20MB，对于较大的数据(150个文件)，将其发送到调度程序的时间太长了，我没有等待它.还有什么其他东西可以发送给调度程序，它们可以占用大量数据?

After monitoring network interfaces, it looks like something is uploaded from the interpreter to the scheduler. The more partitions there are in the dataframe (or bag), the bigger the data is sent to scheduler. I thought it could be the computation graph, but it is really big. For 12 files, it is 2-3MB, for 30 files it is 20MB and for larger data, (150 files) it just takes too long to send it to the scheduler and I didn't wait it. What else is being sent to the scheduler that can take up this amount of data?

ec2上的dask s3访问 [英] dask s3 access on ec2 workers

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

ec2上的dask s3访问 [英] dask s3 access on ec2 workers

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭