在Dask DataFrame.apply()上,在处理实际行之前接收n值为1的行 [英] On Dask DataFrame.apply(), receiving n rows of value 1 before actual rows processed

查看:210
本文介绍了在Dask DataFrame.apply()上,在处理实际行之前接收n值为1的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在下面的代码片段中,我希望日志显示数字0-4.我知道数字可能不按该顺序排列,因为该任务将分解为多个并行操作.

代码段:

from dask import dataframe as dd
import numpy as np
import pandas as pd

df = pd.DataFrame({'A': np.arange(5),
                   'B': np.arange(5),
                   'C': np.arange(5)})

ddf = dd.from_pandas(df, npartitions=1)

def aggregate(x):
    print('B val received: ' + str(x.B))
    return x

ddf.apply(aggregate, axis=1).compute()

但是运行上面的代码时,我看到的却是:

B val received: 1
B val received: 1
B val received: 1
B val received: 0
B val received: 0
B val received: 1
B val received: 2
B val received: 3
B val received: 4

我看到首先打印的是一系列1,而不是0-4.我注意到,每次设置Dask DataFrame并运行apply时,值1的额外"行都会出现.对其进行操作.

打印数据框在整个过程中都没有显示其他值为1的行:

   A  B  C
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4

我的问题是:这些值1的行从何而来?为什么它们似乎始终出现在数据框中的实际"行之前? 1个值似乎与实际行中的值无关(也就是说,由于某种原因,并不是因为第二个行多抓了几次).

解决方案

在尝试对整个分区集合进行尝试之前,Dask会对其进行指示进行检查.那就是前几个打印语句的来源.这是内置错误检查的一部分,可以防止Dask进行一系列冗长的操作并最终导致失败.

In the below code snippet, I would expect the logs to print the numbers 0 - 4. I understand that the numbers may not be in that order, as the task would be broken up into a number of parallel operations.

Code snippet:

from dask import dataframe as dd
import numpy as np
import pandas as pd

df = pd.DataFrame({'A': np.arange(5),
                   'B': np.arange(5),
                   'C': np.arange(5)})

ddf = dd.from_pandas(df, npartitions=1)

def aggregate(x):
    print('B val received: ' + str(x.B))
    return x

ddf.apply(aggregate, axis=1).compute()

But when the above code is run, I see this instead:

B val received: 1
B val received: 1
B val received: 1
B val received: 0
B val received: 0
B val received: 1
B val received: 2
B val received: 3
B val received: 4

Instead of 0 - 4, I see a series of 1 printed first, and an extra 0. I have noticed the "extra" rows of value 1 occurring every time I have set up a Dask DataFrame and run an apply operation on it.

Printing the dataframe shows no additional rows with value 1 throughout:

   A  B  C
0  0  0  0
1  1  1  1
2  2  2  2
3  3  3  3
4  4  4  4

My question is: Where are these rows with value 1 coming from? Why do they appear to consistently occur prior to the "actual" rows in the dataframe? The 1 values seem unrelated to the values in the actual rows (that is, it is not as though it is for some reason grabbing the second row an extra few times).

解决方案

Dask does some checking on what you have told it to do before it tries to do it on the entire collection of partitions. That is where the first few print statements are coming from. It's part of the built in error checking that prevents Dask from going down some long winded series of operations and failing at the end.

这篇关于在Dask DataFrame.apply()上,在处理实际行之前接收n值为1的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆