根据2个现有列的值将新列分配(添加)到dask数据框-涉及条件语句 [英] Assign (add) a new column to a dask dataframe based on values of 2 existing columns - involves a conditional statement

查看:221
本文介绍了根据2个现有列的值将新列分配(添加)到dask数据框-涉及条件语句的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想基于2个现有列的值向现有dask数据帧中添加一个新列,并涉及一个用于检查null的条件语句:

I would like to add a new column to an existing dask dataframe based on the values of the 2 existing columns and involves a conditional statement for checking nulls:

DataFrame定义

import pandas as pd
import dask.dataframe as dd

df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, "", 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df1, npartitions=2)

方法1尝试

def funcUpdate(row):
    if row['y'].isnull():
        return row['y']
    else:
        return  round((1 + row['x'])/(1+ 1/row['y']),4)

ddf = ddf.assign(z= ddf.apply(funcUpdate, axis=1 , meta = ddf))

出现错误:

TypeError: Column assignment doesn't support type DataFrame

方法2

ddf = ddf.assign(z = ddf.apply(lambda col: col.y if col.y.isnull() else  round((1 + col.x)/(1+ 1/col.y),4),axis = 1, meta = ddf))

任何想法应该怎么做?

Any idea how it should be done ?

推荐答案

您可以使用fillna(快速),也可以使用apply(缓慢但灵活)

You can either use fillna (fast) or you can use apply (slow but flexible)

import pandas as pd

import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
ddf = dd.from_pandas(df, npartitions=2)

ddf['z'] = ddf.y.fillna((100 + ddf.x))

>>> df

   x      y
0  1  0.200
1  2    NaN
2  3  0.345
3  4  0.400
4  5  0.150

>>> ddf.compute()

   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

当然,在这种情况下,因为如果y为null,则您的函数使用y,因此结果也将为null.我假设您不打算这样做,所以我稍微更改了输出.

Of course in this case though because your function uses y if y is a null, the result will be null as well. I'm assuming that you didn't intend this, so I changed the output slightly.

任何熊猫专家都会告诉您,使用apply会带来10到100倍的减速损失.请当心.

As any Pandas expert will tell you, using apply comes with a 10x to 100x slowdown penalty. Please beware.

话虽如此,灵活性是有用的.您的示例几乎可以正常工作,只是提供的元数据不正确.您是在告诉我应用该函数会产生一个数据帧,而实际上我认为您的函数旨在产生一个序列.您可以让Dask为您猜测元信息(尽管会抱怨),也可以显式指定dtype.这两个选项都显示在下面的示例中:

That being said, the flexibility is useful. Your example almost works, except that you are providing improper metadata. You are telling apply that the function produces a dataframe, when in fact I think that your function was intended to produce a series. You can have Dask guess the meta information for you (although it will complain) or you can specify the dtype explicitly. Both options are shown in the example below:

In [1]: import pandas as pd
   ...: 
   ...: import dask.dataframe as dd
   ...: df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [0.2, None, 0.345, 0.40, 0.15]})
   ...: ddf = dd.from_pandas(df, npartitions=2)
   ...: 

In [2]: def func(row):
   ...:     if pd.isnull(row['y']):
   ...:         return row['x'] + 100
   ...:     else:
   ...:         return row['y']
   ...:     

In [3]: ddf['z'] = ddf.apply(func, axis=1)
/home/mrocklin/Software/anaconda/lib/python3.4/site-packages/dask/dataframe/core.py:2553: UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result
  warnings.warn(msg)

In [4]: ddf.compute()
Out[4]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

In [5]: ddf['z'] = ddf.apply(func, axis=1, meta=float)

In [6]: ddf.compute()
Out[6]: 
   x      y        z
0  1  0.200    0.200
1  2    NaN  102.000
2  3  0.345    0.345
3  4  0.400    0.400
4  5  0.150    0.150

这篇关于根据2个现有列的值将新列分配(添加)到dask数据框-涉及条件语句的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆