有效地将一列中的值替换为另一列 Pandas DataFrame [英] Efficiently replace values from a column to another column Pandas DataFrame
问题描述
我有一个像这样的 Pandas DataFrame:
I have a Pandas DataFrame like this:
col1 col2 col3
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0 0.4 0.4
4 0 0 0.3
5 0 0 0
6 0.1 0.4 0.4
仅当 col1
值等于 0 时,我才想用第二列 (col2
) 中的值替换 col1
值,之后(对于剩余的零值),再次执行,但使用第三列(col3
).下一个是期望的结果:
I want to replace the col1
values with the values in the second column (col2
) only if col1
values are equal to 0, and after (for the zero values remaining), do it again but with the third column (col3
). The Desired Result is the next one:
col1 col2 col3
1 0.2 0.3 0.3
2 0.2 0.3 0.3
3 0.4 0.4 0.4
4 0.3 0 0.3
5 0 0 0
6 0.1 0.4 0.4
我使用 pd.replace
函数完成了它,但它似乎太慢了..我认为必须是一种更快的方法来实现这一点.
I did it using the pd.replace
function, but it seems too slow.. I think must be a faster way to accomplish that.
df.col1.replace(0,df.col2,inplace=True)
df.col1.replace(0,df.col3,inplace=True)
是否有更快的方法来做到这一点?使用其他函数而不是 pd.replace
函数?
is there a faster way to do that?, using some other function instead of the pd.replace
function?
推荐答案
使用 np.where
更快.使用与 replace
类似的模式:
df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
然而,使用嵌套的 np.where
稍微快一点:
However, using a nested np.where
is slightly faster:
df['col1'] = np.where(df['col1'] == 0,
np.where(df['col2'] == 0, df['col3'], df['col2']),
df['col1'])
时间
使用以下设置生成更大的样本 DataFrame 和计时函数:
Using the following setup to produce a larger sample DataFrame and timing functions:
df = pd.concat([df]*10**4, ignore_index=True)
def root_nested(df):
df['col1'] = np.where(df['col1'] == 0, np.where(df['col2'] == 0, df['col3'], df['col2']), df['col1'])
return df
def root_split(df):
df['col1'] = np.where(df['col1'] == 0, df['col2'], df['col1'])
df['col1'] = np.where(df['col1'] == 0, df['col3'], df['col1'])
return df
def pir2(df):
df['col1'] = df.where(df.ne(0), np.nan).bfill(axis=1).col1.fillna(0)
return df
def pir2_2(df):
slc = (df.values != 0).argmax(axis=1)
return df.values[np.arange(slc.shape[0]), slc]
def andrew(df):
df.col1[df.col1 == 0] = df.col2
df.col1[df.col1 == 0] = df.col3
return df
def pablo(df):
df['col1'] = df['col1'].replace(0,df['col2'])
df['col1'] = df['col1'].replace(0,df['col3'])
return df
我得到以下时间:
%timeit root_nested(df.copy())
100 loops, best of 3: 2.25 ms per loop
%timeit root_split(df.copy())
100 loops, best of 3: 2.62 ms per loop
%timeit pir2(df.copy())
100 loops, best of 3: 6.25 ms per loop
%timeit pir2_2(df.copy())
1 loop, best of 3: 2.4 ms per loop
%timeit andrew(df.copy())
100 loops, best of 3: 8.55 ms per loop
我尝试为您的方法计时,但它已经运行了几分钟而没有完成.作为比较,仅在 6 行示例 DataFrame(不是上面测试的大得多的数据帧)上为您的方法计时需要 12.8 毫秒.
I tried timing your method, but it's been running for multiple minutes without completing. As a comparison, timing your method on just the 6 row example DataFrame (not the much larger one tested above) took 12.8 ms.
这篇关于有效地将一列中的值替换为另一列 Pandas DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!