如何在多台机器上运行dask? [英] How to run dask in multiple machines?
问题描述
我最近发现了Dask。我对Dask Dataframe和其他数据结构有非常基本的问题。
I found Dask recently. I have very basic questions about Dask Dataframe and other data structures.
- Dask Dataframe是不可变的数据类型吗?
- Dask数组和Dataframe是惰性数据结构吗?
我不知道是否要使用dask或spark或pandas 。我有200 GB的数据要计算。使用普通的python程序花费了9个小时来计算操作。但是通过使用16核处理器,它可以在较短的时间内并行处理。如果将数据框划分为大熊猫,则需要担心计算的可交换和关联属性。另一方面,我可以使用独立的Spark集群来拆分数据并并行运行。
I dont know whether to use dask or spark or pandas for my situation. I have 200 GB of data to compute. It took 9 hours to compute operations using plain python program. But it can be processed parallelly in lesser time by utilizing 16 core processor. If I split the dataframe in pandas I need to worry about commutative and associative property of my calculations. On the other hand I can use standalone spark cluster to just split up the data and run parallelly.
是否需要像在Spark中那样在Dask中设置任何集群?< br>
如何在我自己的计算节点中运行Dask数据帧?
Dask是否需要主从设置?
Do I need to setup any clusters in Dask as like as Spark?
How to run Dask dataframes in my own compute nodes?
Does Dask need master-slave setup?
我是熊猫的粉丝,所以我正在寻找类似于熊猫的解决方案。
I am a fan of pandas, so I am looking for solutions similar to pandas.
推荐答案
这里似乎有一些问题
不严格。它们支持列分配。通常,尽管您是正确的,但不支持熊猫的大多数变异操作
Not strictly. They support column assignment. Generally though you're correct that most of the mutation operations of Pandas are not supported
是
否,您可以选择在群集或单台计算机上运行Dask。
No, you can choose to run Dask on a cluster or on a single machine.
请参见 Dask.distributed 的文档设置文档尤其是
这个问题过于笼统,取决于情况
This question is overly broad and depends on the situation
这篇关于如何在多台机器上运行dask?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!