Dask Dataframe有效的行对生成器? [英] Dask Dataframe Effecient Row Pair Generator?

查看:126
本文介绍了Dask Dataframe有效的行对生成器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

就输入输出而言,我真正想要实现的是交叉连接

What exactly I want to achieve in terms of input output is a cross - join

输入示例

df = pd.DataFrame(columns = ['A', 'val'], data = [['a1', 23],['a2', 29], ['a3', 39]])
print(df)

    A  val
0  a1   23
1  a2   29
2  a3   39

输出示例:

df['key'] = 1
df.merge(df, how = "outer", on ="key")

  A_x  val_x  key A_y  val_y
0  a1     23    1  a1     23
1  a1     23    1  a2     29
2  a1     23    1  a3     39
3  a2     29    1  a1     23
4  a2     29    1  a2     29
5  a2     29    1  a3     39
6  a3     39    1  a1     23
7  a3     39    1  a2     29
8  a3     39    1  a3     39

我如何使用Dask对大型数据集实现这一目标?

How I achieve this for a large dataset with Dask ?

我有兴趣获取Dask Dataframe的所有行对组合(类似于笛卡尔乘积)以进一步计算行间度量,例如距离等,但是当我在本地使用Dask Distributed时总是会遇到内存错误我正在尝试实现的示例.

I am interested in getting all row pair combinations of a Dask Dataframe (Similar to a Cartesian Product) to further calculate inter row metrics like distance etc.But I always get a memory error when using Dask Distributed locally, I provided a toy example of what I am trying to achieve.

我是新手,所以我只想知道这在本地是否可行?我理想的分区大小应该是多少?使用dask获取行对的更好方法是什么?

I am new to dask so I just want to know is this even possible locally ? What should be my ideal paritions size ? What is a better way to get row pairs using dask?

import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client
client = Client()
client
df = pd.DataFrame(columns = list(range(50)), data =  np.random.rand(10000,50))
ddf = dd.from_pandas(df, npartitions=10)  # rather than here
ddf = ddf.assign(key = 0)
ddf = dd.merge(ddf, ddf, suffixes=('', '_ch'), on='key', 
npartitions = 10000, how = 'outer')
ddf['0'].mean().compute()

我收到以下错误:

MemoryError: Unable to allocate 37.3 GiB for an 
array with shape (100000000, 50) and data type float64

本地群集详细信息

Scheduler: tcp://127.0.0.1:52435
Dashboard: http://127.0.0.1:8787/status
Cluster
Workers: 4
Cores: 12
Memory: 34.10 GB

[]

推荐答案

完整的外部产品可能会生成非常大的数据集.假设您的每个中间熊猫数据帧只有一百万行.该数据框本身的叉积将容纳一万亿行,因此很可能会耗尽该计算机上的内存.

A full outer product is likely to generate a very large dataset. Let's say that each of your intermediate pandas dataframes has only a million rows. A cross product of that dataframe with itself would hold one trillion rows, and so would probably blow out memory on that machine.

我怀疑您实际上还想做其他事情,而这是朝着这个方向迈出的一步.我建议尝试找到您想要的其他路径.例如,如果要关联,可以尝试使用corr方法.

I suspect that you actually want to do something else, and that this is a step in that direction. I recommend trying to find another path to what you want. For example, if you want correlation, maybe try out the corr method.

这篇关于Dask Dataframe有效的行对生成器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆