有效地将列中的值替换为另一列Pandas DataFrame [英] efficiently replace values from a column to another column Pandas DataFrame

查看:3958
本文介绍了有效地将列中的值替换为另一列Pandas DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Pandas DataFrame,如下所示:

  col1 col2 col3 
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0 0.4 0.4
4 0 0 0.3
5 0 0 0
6 0.1 0.4 0.4
/ pre>

我想将 col1 值替换为第二列中的值( col2 )只有当 col1 值等于0时,之后(剩余零值),再次执行,但第三列( col3 )。所需结果是下一个:

  col1 col2 col3 
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0.4 0.4 0.4
4 0.3 0 0.3
5 0 0 0
6 0.1 0.4 0.4

我使用 pd.replace 函数,但似乎太慢了..我认为必须是一个更快的方法做到这一点

  df.col1.replace(0,df.col2,inplace = True)
df.col1.replace 0,df.col3,inplace = True)

有更快的方法吗?某些其他功能而不是 pd.replace 函数?可能切片 df.col1



PS1。附加信息:问题的原始目标是使用某些组ID的统计模式替换零值(在 col1 )中。这就是为什么我有 col2 (最佳模式,替换的第一个选项)和 col3 (最后一个模式,仍然功能但不如 col2 )。

解决方案

使用 np.where 更快。使用与中使用的类似模式替换

  df [ col1'] = np.where(df ['col1'] == 0,df ['col2'],df ['col1'])
df ['col1'] = np.where(df [ col1'] == 0,df ['col3'],df ['col1'])

但是,使用嵌套的 np.where 稍快一些:

  df ['col1'] = np.where(df ['col1'] == 0,
np.where(df ['col2'] == 0,df ['col3'],df ['col2' ]),
df ['col1'])

计时



使用以下设置生成更大的示例DataFrame和计时功能:

  df = pd.concat([df] * 10 ** 4,ignore_index = True)

def root_nested(df):
df ['col1'] = np。其中(df ['col1'] == 0,np.where(df ['col2'] == 0,df ['col3'],df ['col2']),df ['col1'])
return df

def roo t_split(df):
df ['col1'] = np.where(df ['col1'] == 0,df ['col2'],df ['col1'])
df [ 'col1'] = np.where(df ['col1'] == 0,df ['col3'],df ['col1'])
return df

def pir2 df):
df ['col1'] = df.where(df.ne(0),np.nan).bfill(axis = 1).col1.fillna(0)
return df

def pir2_2(df):
slc =(df.values!= 0).argmax(axis = 1)
return df.values [np.arange(slc.shape [0]),slc]

def andrew(df):
df.col1 [df.col1 == 0] = df.col2
df.col1 [df。 col1 == 0] = df.col3
return df

def pablo(df):
df ['col1'] = df ['col1']。replace ,df ['col2'])
df ['col1'] = df ['col1']。replace(0,df ['col3'])
return df

我得到以下时间:

 %timeit root_nested(df.copy())
100循环,最好3:2.25 ms每循环

%timeit root_split(df.copy())
100循环,最好的3:2.62 ms每循环

%timeit pir2(df.copy())
100循环,最好3:6.25 ms每循环

%timeit pir2_2 (df.copy())
1循环,最好3:每循环2.4 ms

%timeit andrew(df.copy())
100循环,最好的3 :每循环8.55 ms

我尝试对你的方法进行计时,但是它已经运行了几分钟,没有完成。作为比较,在6行示例DataFrame(不是上面测试的更大的一个)中计时您的方法花了12.8 ms。


I have a Pandas DataFrame like the following one:

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3     0  0.4  0.4
4     0    0  0.3
5     0    0    0
6   0.1  0.4  0.4

I want to replace the col1 values with the values in the second column (col2) only if col1 values are equal to 0, and after (for the zero values remaining), do it again but with the third column (col3). The Desired Result is the next one:

   col1 col2 col3
1   0.2  0.3  0.3
2   0.2  0.3  0.3
3   0.4  0.4  0.4
4   0.3    0  0.3
5     0    0    0
6   0.1  0.4  0.4

I did it using the pd.replace function, but it seems too slow.. I think must be a faster way to accomplish that.

df.col1.replace(0,df.col2,inplace=True)
df.col1.replace(0,df.col3,inplace=True)

is there a faster way to do that?, using some other function instead of the pd.replace function?, maybe slicing the df.col1 ?

PS1. Additional info: the original goal of the problem is to replace the zero values (in col1) with the statistic mode of some groups of ids. That is why I have col2(best mode, first option of replacement) and col3(last mode, still functional but not so desirable as col2).

解决方案

Using np.where is faster. Using a similar pattern as you used with replace:

df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])

However, using a nested np.where is slightly faster:

df['col1'] = np.where(df['col1'] == 0, 
                      np.where(df['col2'] == 0, df['col3'], df['col2']),
                      df['col1'])

Timings

Using the following setup to produce a larger sample DataFrame and timing functions:

df = pd.concat([df]*10**4, ignore_index=True)

def root_nested(df):
    df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
    return df

def root_split(df):
    df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
    df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
    return df

def pir2(df):
    df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
    return df

def pir2_2(df):
    slc = (df.values != 0).argmax(axis=1)
    return df.values[np.arange(slc.shape[0]), slc]

def andrew(df):
    df.col1[df.col1 == 0] = df.col2
    df.col1[df.col1 == 0] = df.col3
    return df

def pablo(df):
    df['col1'] = df['col1'].replace(0,df['col2'])
    df['col1'] = df['col1'].replace(0,df['col3'])
    return df

I get the following timings:

%timeit root_nested(df.copy())
100 loops, best of 3: 2.25 ms per loop

%timeit root_split(df.copy())
100 loops, best of 3: 2.62 ms per loop

%timeit pir2(df.copy())
100 loops, best of 3: 6.25 ms per loop

%timeit pir2_2(df.copy())
1 loop, best of 3: 2.4 ms per loop

%timeit andrew(df.copy())
100 loops, best of 3: 8.55 ms per loop

I tried timing your method, but it's been running for multiple minutes without completing. As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.

这篇关于有效地将列中的值替换为另一列Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆