使用正则表达式有效地将 pandas 中一列的值替换为另一列中的值? [英] Efficiently replace part of value from one column with value from another column in pandas using regex?

查看:159
本文介绍了使用正则表达式有效地将 pandas 中一列的值替换为另一列中的值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个以日期为字符串的熊猫数据框df:

I have a pandas dataframe df with dates as strings:

Date1        Date2
2017-08-31   1970-01-01 17:35:00
2017-10-31   1970-01-01 15:00:00
2017-11-30   1970-01-01 16:30:00
2017-10-31   1970-01-01 16:00:00
2017-10-31   1970-01-01 16:12:00

我想做的是将Date2列中的每个日期部分替换为Date1中的相应日期,但不改变时间,因此输出为:

What I want to do is replace each date part in the Date2 column with the corresponding date in Date1 but leave the time untouched, so the output is:

Date1        Date2
2017-08-31   2017-08-31 17:35:00
2017-10-31   2017-10-31 15:00:00
2017-11-30   2017-11-30 16:30:00
2017-10-31   2017-10-31 16:00:00
2017-10-31   2017-10-31 16:12:00

我已经使用pandas replace和正则表达式实现了

I have achieved this using pandas replace and regex's as such

import re
date_reg = re.compile(r"([0-9]{4}\-[0-9]{2}\-[0-9]{2})")
df['Market Close Time'].replace(to_replace=date_reg, value=df['Date1'], inplace=True)

但是对于只有15万行的数据帧,此方法非常慢(> 10分钟).

But this method is very slow (>10 minutes) for a dataframe with only 150k rows.

来自

The solution from this post implements numpy np.where which is much faster - how can I use the np.where in this example, or is there another more efficient way to perform this operation?

推荐答案

一个想法是:

df['Date3'] =  ['{} {}'.format(a, b.split()[1]) for a, b in zip(df['Date1'], df['Date2'])]

或者:

df['Date3'] = df['Date1'] + ' ' + df['Date2'].str.split().str[1]
print (df)
        Date1                Date2                Date3
0  2017-08-31  1970-01-01 17:35:00  2017-08-31 17:35:00
1  2017-10-31  1970-01-01 15:00:00  2017-10-31 15:00:00
2  2017-11-30  1970-01-01 16:30:00  2017-11-30 16:30:00
3  2017-10-31  1970-01-01 16:00:00  2017-10-31 16:00:00
4  2017-10-31  1970-01-01 16:12:00  2017-10-31 16:12:00

或者:

df['Date3'] = pd.to_datetime(df['Date1']) + pd.to_timedelta(df['Date2'].str.split().str[1])
print (df)
        Date1                Date2               Date3
0  2017-08-31  1970-01-01 17:35:00 2017-08-31 17:35:00
1  2017-10-31  1970-01-01 15:00:00 2017-10-31 15:00:00
2  2017-11-30  1970-01-01 16:30:00 2017-11-30 16:30:00
3  2017-10-31  1970-01-01 16:00:00 2017-10-31 16:00:00
4  2017-10-31  1970-01-01 16:12:00 2017-10-31 16:12:00

时间:

In [302]: %timeit df['Date3'] =  ['{} {}'.format(a, b.split()[1]) for a, b in zip(df['Date1'], df['Date2'])]
30.2 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [303]: %timeit df['Date3'] = df['Date1'] + ' ' + df['Date2'].str.split().str[1]
66.4 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

这篇关于使用正则表达式有效地将 pandas 中一列的值替换为另一列中的值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆