在dask和pandas数据框中应用的不兼容性 [英] Incompatibility of apply in dask and pandas dataframes

查看:174
本文介绍了在dask和pandas数据框中应用的不兼容性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 Dask数据框中的triggers列的示例如下:

A sample of the triggers column in my Dask dataframe looks like the following:

0    [Total Traffic, DNS, UDP]
1                    [TCP RST]
2              [Total Traffic]
3                 [IP Private]
4                       [ICMP]
Name: triggers, dtype: object

我希望通过执行以下操作来创建上述数组的一个热编码版本(例如,将1与第1行中的DNS列相对应). pop_triggers包含triggers的所有可能值.

I wish to create a one hot encoded version of the above arrays (putting a 1 against the DNS column in row 1 for example) by doing the following. pop_triggers contains all possible values of triggers.

for trig in pop_triggers:
    df[trig] = df.triggers.apply(lambda x: 1 if trig in x else 0)

但是,Total TrafficDNS等列均包含值0,而相关值不包含1.当我将其复制到pandas数据框中并执行相同的操作时,它们将获得预期的值.

However, the Total Traffic, DNS etc. columns all contain the value 0 and not 1 for the relevant value. When I copy it into a pandas dataframe and do the same operation, they get the expected value.

a = df[[ 'Total Traffic', 'UDP', 'NTP Amplification', 'triggers', 'ICMP']].head()
for trig in pop_triggers:
    a[trig] = a.triggers.apply(lambda x: 1 if trig in x else 0)

我在这里想念什么?是因为dask懒于以某种方式没有按预期填写值吗?

What am I missing here? Is it because dask is lazy that somehow it's not filling in the values as expected?

修改1: 我调查了一些将标志设置在第一位的地方(结果远低于我的预期,并得到了一些非常奇怪的结果.请参见下文:

Edit 1: I investigated some places where the flag was set in the first place (which turned out to be far less than I expected, and got some really weird results. See below:

df2 = df[df['Total Traffic']==1]
df2[['triggers']+pop_triggers].head()

输出:

        triggers    Total Traffic   UDP DNS
9380    [ICMP, IP null, IP Private, TCP null, TCP SYN,...   1   1   1
9388    [ICMP, IP null, IP Private, TCP null, TCP SYN,...   1   1   1
19714   [ICMP, IP null, IP Private, UDP, NTP Amplifica...   1   1   1
21556   [IP null]   1   1   1
21557   [IP null]   1   1   1

可能是错误吗?

修改2: 最小的工作示例:

triggers = [['Total Traffic', 'DNS', 'UDP'],['TCP RST'],['Total Traffic'],['IP Private'],['ICMP']]*10
df2 = dd.from_pandas(pd.DataFrame({'triggers':triggers}), npartitions=16)
pop_triggers= ['Total Traffic', 'UDP', 'DNS', 'TCP SYN', 'TCP null', 'ICMP']
for trig in pop_triggers:
    df2[trig] = df2.triggers.apply(lambda x: 1 if trig in x else 0)
df2.head()

输出:

triggers    Total Traffic   UDP DNS TCP SYN TCP null    ICMP
0   [Total Traffic, DNS, UDP]   0   0   0   0   0   0
1   [TCP RST]   0   0   0   0   0   0
2   [Total Traffic] 0   0   0   0   0   0
3   [IP Private]    0   0   0   0   0   0

注意:我更关心事物的达斯面而不是熊猫

Note: I am far more concerned about the Dask side of things and not Pandas

推荐答案

根据我的经验,dask中的apply与显式

In my experience apply in dask works much better with explicit metadata. There are some functionality that let dask attempt to guess the metadata but I found it slow and not always reliable. Also the guidance is to specify meta.

根据我的经验,另一点是assign的效果要优于df[col] = ....不知道这是我的错误,限制还是误用(我前一段时间进行了研究,但我认为这不是错误).

Another point in my experience is that assign works better than df[col] = .... Not sure if it's a bug, a limitation or a misuse on my side (I researched that a while ago and I don't think it's a bug).

第一个模式不起作用,用于循环中前几列的trig值似乎已更新为后来的值,因此在计算时,这仅给出了以下结果:所有列的最后一个值!

The first pattern doesn't work, the trig value used for previous columns in the loop seems to be updated with later values so at compute time, this gives only the result of the last value for all columns!

这不是错误,而是没有立即计算的组合,而闭包上延迟计算的lambda结果尚未评估.请参阅此讨论,以了解为什么它不起作用.

It's not a bug but the combination of not computing immediately while the lambda result of the delayed computation on the closure which is not evaluated yet. See this discussion for why it doesn't work.

我为您提供的模式将是:

My pattern for you would then be:

cols = {}
for trig in pop_triggers:
    meta = (trig, int)
    cols[trig] = df.triggers.apply(lambda x: 1 if trig in x else 0, meta=meta)
df = df.assign(**cols)

正确的图案:

(抱歉,之前我没有测试过相同的模式,只是我没有在应用函数中使用循环值,所以没有面对这种行为)

(sorry didn't test previously as I run the same pattern except I don't use the looping value in the applied function so didn't face that behavior)

cols = {}

for trig in pop_triggers:
    meta = (trig, int)

    def fn(x, t):
        return 1 if t in x else 0

    cols[trig] = ddf.triggers.apply(fn, args=(trig,), meta=meta)
ddf = ddf.assign(**cols)

这篇关于在dask和pandas数据框中应用的不兼容性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆