Dask DataFrame的逐行处理 [英] Row by row processing of a Dask DataFrame

查看:433
本文介绍了Dask DataFrame的逐行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要处理一个大文件并更改一些值。

I need to process a large file and to change some values.

我想执行以下操作:

for index, row in dataFrame.iterrows():

        foo = doSomeStuffWith(row)
        lol = doOtherStuffWith(row)

        dataFrame['colx'][index] = foo
        dataFrame['coly'][index] = lol

对我来说不好,我不能做 dataFrame ['colx'] [index] = foo

Bad for me, I cannot do dataFrame['colx'][index] = foo!

我的行数很大,我需要处理大量的列。因此,如果我为每一列执行一个dataFrame.apply(...),恐怕dask可能会多次读取文件。

My number of row is quite large and I need to process a large number of column. So I'm afraid that dask may read the file several times if I do one dataFrame.apply(...) for each column.

其他解决方案是手动中断将我的数据分成大块,使用大熊猫或将任何东西扔到数据库中。但是,如果我可以继续使用.csv并让dask为我完成数据块处理,那就太好了!

Other solutions are to manually break my data into chunks and to use pandas or to just throw anything in a database. But it could be nice if I may keep using my .csv and let dask do the chunk processing for me!

感谢您的帮助。

推荐答案

通常,在数据帧(Pandas或Dask)上进行迭代可能很慢。另外,Dask将不支持按行元素插入。这种工作量很难扩展。

In general iterating over a dataframe, either Pandas or Dask, is likely to be quite slow. Additionally Dask won't support row-wise element insertion. This kind of workload is difficult to scale.

相反,我建议使用dd.Series.where(请参见此答案),或者在函数中进行迭代(在进行复制以免原地操作之后),然后使用map_partitions在Dask中的所有Pandas数据帧中调用该函数数据框。

Instead I recommend using dd.Series.where (See this answer) or else doing your iteration in a function (after making a copy so as not to operate in place) and then using map_partitions to call that function across all of the Pandas dataframes in your Dask dataframe .

这篇关于Dask DataFrame的逐行处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆