如何在Dask中进行行处理和项目分配 [英] How to do row processing and item assignment in Dask

查看:78
本文介绍了如何在Dask中进行行处理和项目分配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

类似的未解决问题:逐行处理Dask DataFrame

我正在使用数百万行长的数据帧,因此现在我试图使所有数据帧操作并行执行.我需要转换为Dask的一种这样的操作是:

I'm working with dataframes that are millions on rows long, and so now I'm trying to have all dataframe operations performed in parallel. One such operation I need converted to Dask is:

 for row in df.itertuples():                                                                                                                                                                                                         
     ratio = row.ratio                                                                                                                                                                                                                     
     tmpratio = row.tmpratio                                                                                                                                                                                                                                                                                                                                                                                                 
     tmplabel = row.tmplabel                                                                                                                                                                                                               
     if tmpratio > ratio:                                                                                                                                                                                                                  
         df.loc[row.Index,'ratio'] = tmpratio                                                                                                                                                                                        
         df.loc[row.Index,'label'] = tmplabel

在Dask中按索引设置值或在行中有条件设置值的合适方法是什么?鉴于.loc在Dask中不支持项目分配,因此在Dask中似乎也没有set_valueat[]iat[].

What is the appropriate way to set a value by index in Dask, or conditionally set values in rows? Given that .loc doesn't support item assignment in Dask, there does not appear to be a set_value, at[], or iat[] in Dask either.

我尝试使用 map_partitions 分配,但我看不到在行级执行条件赋值的任何功能.

I have attempted to use map_partitions with assign, but I am not seeing any ability to perform conditional assignment at the row-level.

推荐答案

Dask数据框不支持有效的迭代或行分配.通常,这些工作流很难很好地扩展.它们在熊猫本身中也相当慢.

Dask dataframe does not support efficient iteration or row assignment. In general these workflows rarely scale well. They are also quite slow in Pandas itself.

相反,您可以考虑使用 Series.where 方法.这是一个最小的示例:

Instead, you might consider using the Series.where method. Here is a minimal example:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 2, 1]})

In [3]: import dask.dataframe as dd

In [4]: ddf = dd.from_pandas(df, npartitions=2)

In [5]: ddf['z'] = ddf.x.where(ddf.x > ddf.y, ddf.y)

In [6]: ddf.compute()
Out[6]:
   x  y  z
0  1  3  3
1  2  2  2
2  3  1  3

这篇关于如何在Dask中进行行处理和项目分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆