Python Dask map_partitions [英] Python Dask map_partitions

查看:245
本文介绍了Python Dask map_partitions的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能是此问题的延续,它来自于map_partitions的dask文档示例。

Probably a continuation of this question, working from the dask docs examples for map_partitions.

import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5],     'y': [1., 2., 3., 4., 5.]})
ddf = dd.from_pandas(df, npartitions=2)

from random import randint

def myadd(df):
    new_value = df.x + randint(1,4)
    return new_value

res = ddf.map_partitions(lambda df: df.assign(z=myadd)).compute()
res

在上面的代码中,randint仅被调用一次,而不是我期望的每行一次。怎么会?

In the above code, randint is only being called once, not once per row as I would expect. How come?

输出:

XYZ

1 1 4

2 2 5

3 3 6

4 4 7

5 5 8

推荐答案

如果在原始的熊猫数据帧上执行相同的操作( df.x + randint(1,4)),您将只会获得一个随机数,并添加到柱。这与pandas情况完全相同,只是每个分区都被调用一次-这就是 map_partition 的作用。

If you performed the same operation (df.x + randint(1,4)) on the original pandas dataframe, you would only get one random number, added to every previous value of the column. This is doing exactly the same as the pandas case, except that it is being called once for each partition - this is what map_partition does.

如果您想为每行添加一个新的随机数,则应首先考虑如何使用熊猫来实现这一目标。我马上想到两个:

If you wanted a new random number for every row, you should first think of how you would achieve this with pandas. I can immediately think of two:

df.x.map(lambda x: x + random.randint(1, 4))

df.x + np.random.randint(1, 4, size=len(df.x))

如果将其中的 newvalue = 行替换为其中之一,它将按预期运行。

If you replace your newvalue = line with one of these, it will work as expected.

这篇关于Python Dask map_partitions的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆