从客户端将本地文件加载到dask分布式集群上 [英] Loading local file from client onto dask distributed cluster

查看:176
本文介绍了从客户端将本地文件加载到dask分布式集群上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

一个初学者的问题,但是我没有找到相关的答案。

A bit of a beginner question, but I was not able to find a relevant answer on this..

基本上,关于(7gb)的数据位于我的本地机器。我已经在本地网络上运行了分布式集群。如何将这个文件放到集群上?

Essentially my data about (7gb) is located on my local machine. I have distributed cluster running on the local network. How can I get this file onto the cluster?

通常的dd.read_csv()或read_parquet()会失败,因为工作人员无法在自己的环境中找到文件。

The usual dd.read_csv() or read_parquet() fails, as the workers aren't able to locate the file in their own environments.

我需要手动将文件传输到群集中的每个节点吗?

Would I need to manually transfer the file to each node in the cluster?

注意:由于管理员的限制,我仅限于SFTP ...

Note: Due to admin restrictions I am limited to SFTP...

推荐答案

两个选项

如评论中所建议,可以使用多种方法使用常规文件系统解决方案使本地文件对群集中的其他计算机可访问。

As suggested in the comments, there are various ways to make your local file accessible to other machines in your cluster using normal file system solutions. This is a great choice if accessible to you.

如果这不起作用那么您始终可以在本地加载数据并将其分散到群集的各个工作人员中。如果文件大于单台计算机的内存,则可能需要逐段执行此操作。

If that doesn't work then you can always load data locally and scatter it out to the various workers of your cluster. If your file is larger than the memory of your single computer then you might have to do this piece by piece.

如果一切都适合内存,那么我将正常加载数据,然后将其分散给工作人员。您可以根据需要将其拆分出来,然后传播给其他工作人员:

If everything fits in memory then I would load the data normally and then scatter it out to a worker. You could split it out afterwards and spread it to other workers if desired:

import pandas
import dask.dataframe as dd
from dask.distributed import Client

client = Client('scheduler-address:8786')

df = pd.read_csv('myfile.csv')
future = client.scatter(df)  # send dataframe to one worker
ddf = dd.from_delayed([future], meta=df)  # build dask.dataframe on remote data
ddf = ddf.repartition(npartitions=20).persist()  # split
client.rebalance(ddf)  # spread around all of your workers



多个位



如果您有多个小文件,则可以迭代加载和分散,也许在for循环中,然后从许多期货中创建一个dask.dataframe

Multiple bits

If you have multiple small files then you can iteratively load and scatter, perhaps in a for loop, and then make a dask.dataframe from many futures

futures = []
for fn in filenames:
    df = pd.read_csv(fn)
    future = client.scatter(df)
    futures.append(future)

ddf = dd.from_delayed(futures, meta=df)

在这种情况下,您可以跳过重新分区和重新平衡的步骤

In this case you could probably skip the repartition and rebalance steps

如果您只有一个大文件,则可能需要自己进行拆分,或者使用 pd.read_csv(...,chunksize = ...)

If you have a single large file then you would probably have to do some splitting of it yourself, either with pd.read_csv(..., chunksize=...)

这篇关于从客户端将本地文件加载到dask分布式集群上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆