有效地将列中的值替换为另一列Pandas DataFrame [英] efficiently replace values from a column to another column Pandas DataFrame
问题描述
我有一个Pandas DataFrame,如下所示:
col1 col2 col3
/ pre>
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0 0.4 0.4
4 0 0 0.3
5 0 0 0
6 0.1 0.4 0.4
我想将
col1
值替换为第二列中的值(col2
)只有当col1
值等于0时,之后(剩余零值),再次执行,但第三列(col3
)。所需结果是下一个:col1 col2 col3
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0.4 0.4 0.4
4 0.3 0 0.3
5 0 0 0
6 0.1 0.4 0.4
我使用
pd.replace
函数,但似乎太慢了..我认为必须是一个更快的方法做到这一点df.col1.replace(0,df.col2,inplace = True)
df.col1.replace 0,df.col3,inplace = True)
有更快的方法吗?某些其他功能而不是
pd.replace
函数?可能切片df.col1
?
PS1。附加信息:问题的原始目标是使用某些组ID的统计模式替换零值(在
col1
)中。这就是为什么我有col2
(最佳模式,替换的第一个选项)和col3
(最后一个模式,仍然功能但不如col2
)。解决方案使用
np.where
更快。使用与中使用的类似模式替换
:df [ col1'] = np.where(df ['col1'] == 0,df ['col2'],df ['col1'])
df ['col1'] = np.where(df [ col1'] == 0,df ['col3'],df ['col1'])
但是,使用嵌套的
np.where
稍快一些:df ['col1'] = np.where(df ['col1'] == 0,
np.where(df ['col2'] == 0,df ['col3'],df ['col2' ]),
df ['col1'])
计时
使用以下设置生成更大的示例DataFrame和计时功能:
df = pd.concat([df] * 10 ** 4,ignore_index = True)
def root_nested(df):
df ['col1'] = np。其中(df ['col1'] == 0,np.where(df ['col2'] == 0,df ['col3'],df ['col2']),df ['col1'])
return df
def roo t_split(df):
df ['col1'] = np.where(df ['col1'] == 0,df ['col2'],df ['col1'])
df [ 'col1'] = np.where(df ['col1'] == 0,df ['col3'],df ['col1'])
return df
def pir2 df):
df ['col1'] = df.where(df.ne(0),np.nan).bfill(axis = 1).col1.fillna(0)
return df
def pir2_2(df):
slc =(df.values!= 0).argmax(axis = 1)
return df.values [np.arange(slc.shape [0]),slc]
def andrew(df):
df.col1 [df.col1 == 0] = df.col2
df.col1 [df。 col1 == 0] = df.col3
return df
def pablo(df):
df ['col1'] = df ['col1']。replace ,df ['col2'])
df ['col1'] = df ['col1']。replace(0,df ['col3'])
return df
我得到以下时间:
%timeit root_nested(df.copy())
100循环,最好3:2.25 ms每循环
%timeit root_split(df.copy())
100循环,最好的3:2.62 ms每循环
%timeit pir2(df.copy())
100循环,最好3:6.25 ms每循环
%timeit pir2_2 (df.copy())
1循环,最好3:每循环2.4 ms
%timeit andrew(df.copy())
100循环,最好的3 :每循环8.55 ms
我尝试对你的方法进行计时,但是它已经运行了几分钟,没有完成。作为比较,在6行示例DataFrame(不是上面测试的更大的一个)中计时您的方法花了12.8 ms。
I have a Pandas DataFrame like the following one:
col1 col2 col3 1 0.2 0.3 0.3 2 0.2 0.3 0.3 3 0 0.4 0.4 4 0 0 0.3 5 0 0 0 6 0.1 0.4 0.4
I want to replace the
col1
values with the values in the second column (col2
) only ifcol1
values are equal to 0, and after (for the zero values remaining), do it again but with the third column (col3
). The Desired Result is the next one:col1 col2 col3 1 0.2 0.3 0.3 2 0.2 0.3 0.3 3 0.4 0.4 0.4 4 0.3 0 0.3 5 0 0 0 6 0.1 0.4 0.4
I did it using the
pd.replace
function, but it seems too slow.. I think must be a faster way to accomplish that.df.col1.replace(0,df.col2,inplace=True) df.col1.replace(0,df.col3,inplace=True)
is there a faster way to do that?, using some other function instead of the
pd.replace
function?, maybe slicing thedf.col1
?PS1. Additional info: the original goal of the problem is to replace the zero values (in
col1
) with the statistic mode of some groups of ids. That is why I havecol2
(best mode, first option of replacement) andcol3
(last mode, still functional but not so desirable ascol2
).解决方案Using
np.where
is faster. Using a similar pattern as you used withreplace
:df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1']) df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
However, using a nested
np.where
is slightly faster:df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
Timings
Using the following setup to produce a larger sample DataFrame and timing functions:
df = pd.concat([df]*10**4, ignore_index=True) def root_nested(df): df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1']) return df def root_split(df): df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1']) df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1']) return df def pir2(df): df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0) return df def pir2_2(df): slc = (df.values != 0).argmax(axis=1) return df.values[np.arange(slc.shape[0]), slc] def andrew(df): df.col1[df.col1 == 0] = df.col2 df.col1[df.col1 == 0] = df.col3 return df def pablo(df): df['col1'] = df['col1'].replace(0,df['col2']) df['col1'] = df['col1'].replace(0,df['col3']) return df
I get the following timings:
%timeit root_nested(df.copy()) 100 loops, best of 3: 2.25 ms per loop %timeit root_split(df.copy()) 100 loops, best of 3: 2.62 ms per loop %timeit pir2(df.copy()) 100 loops, best of 3: 6.25 ms per loop %timeit pir2_2(df.copy()) 1 loop, best of 3: 2.4 ms per loop %timeit andrew(df.copy()) 100 loops, best of 3: 8.55 ms per loop
I tried timing your method, but it's been running for multiple minutes without completing. As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.
这篇关于有效地将列中的值替换为另一列Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!