在 pandas 数据帧中向量化条件分配 [英] vectorize conditional assignment in pandas dataframe
问题描述
如果我的数据框df
具有列x
,并且想基于x
的值创建列y
,请在伪代码中使用此列:
if df['x'] <-2 then df['y'] = 1
else if df['x'] > 2 then df['y']= -1
else df['y'] = 0
我将如何实现?我认为np.where
是执行此操作的最佳方法,但不确定如何正确编码.
一种简单的方法是先分配默认值,然后执行2次loc
调用:
In [66]:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
x
0 0
1 -3
2 5
3 -1
4 1
In [69]:
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
如果您想使用np.where
,则可以使用嵌套的np.where
:
In [77]:
df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
因此,在这里我们将第一个条件定义为x小于-2,返回1,然后有另一个np.where
测试另一个条件,其中x大于2并返回-1,否则返回0 >
时间
In [79]:
%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
1000 loops, best of 3: 1.79 ms per loop
In [81]:
%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
100 loops, best of 3: 3.27 ms per loop
因此,对于此示例数据集,np.where
方法的速度是其两倍快
If I have a dataframe df
with column x
and want to create column y
based on values of x
using this in pseudo code:
if df['x'] <-2 then df['y'] = 1
else if df['x'] > 2 then df['y']= -1
else df['y'] = 0
How would I achieve this? I assume np.where
is the best way to do this but not sure how to code it correctly.
One simple method would be to assign the default value first and then perform 2 loc
calls:
In [66]:
df = pd.DataFrame({'x':[0,-3,5,-1,1]})
df
Out[66]:
x
0 0
1 -3
2 5
3 -1
4 1
In [69]:
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
df
Out[69]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
If you wanted to use np.where
then you could do it with a nested np.where
:
In [77]:
df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
df
Out[77]:
x y
0 0 0
1 -3 1
2 5 -1
3 -1 0
4 1 0
So here we define the first condition as where x is less than -2, return 1, then we have another np.where
which tests the other condition where x is greater than 2 and returns -1, otherwise return 0
timings
In [79]:
%timeit df['y'] = np.where(df['x'] < -2 , 1, np.where(df['x'] > 2, -1, 0))
1000 loops, best of 3: 1.79 ms per loop
In [81]:
%%timeit
df['y'] = 0
df.loc[df['x'] < -2, 'y'] = 1
df.loc[df['x'] > 2, 'y'] = -1
100 loops, best of 3: 3.27 ms per loop
So for this sample dataset the np.where
method is twice as fast
这篇关于在 pandas 数据帧中向量化条件分配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!