在 pandas 数据框中高效地转换时区 [英] Efficiently convert timezones in pandas dataframe
问题描述
我有一个大熊猫数据框(数千万行),其中包含一列UTC时间和时区.我想基于其他两个列创建一个包含行的本地时间的列.
I have a large pandas dataframe (tens of millions of rows) which includes a column for UTC time and the time zone. I want to create a column which contains the local time for the row, based on these two other columns.
我最初的尝试是使用 df.apply
,它在我正在测试的一个小样本上工作,但是非常慢,并且不足以处理整个数据:
My original attempt was using df.apply
which worked on a small sample that I was testing on, but is very slow and isn't good enough to work on the whole data:
df['LoginTimeLocal'] = \
df.apply(lambda row: row.LoginTimeUtc.tz_localize('UTC').tz_convert(row.TimeZoneCode))
这将导致添加新列,其中包含本地时间的日期时间以及时区信息.
This results in a new column being added which contains a datetime in local time, with timezone information.
我遇到了此答案,该答案提供了一种高效的矢量化方式来执行类似操作.我将这段代码重新用于执行我想要的操作,但它似乎仅在新列仅包含具有相同时区(或没有时区信息)的日期时才起作用.这是我的代码:
I came across this answer which provides an efficient, vectorized way to do something similar. I re-purposed this code to do what I want, but it appears to only work if the new column only contains dates with the same time zone (or no time zone information). Here's my code:
# localize all utc dates
df['LoginTimeUtc'] = df['LoginTimeUtc'].dt.tz_localize('UTC')
# initialize LoginTimeLocal column (probably not necessary)
df['LoginTimeLocal'] = df['LoginTimeUtc']
# for every time zone in the data
for tz in df.TimeZoneCode.unique():
mask = (df.TimeZoneCode == tz)
# make entries in a new column with converted timezone
df.loc[mask, 'LoginTimeLocal'] = \
df.loc[mask,'LoginTimeLocal'].dt.tz_convert(tz)
如果我对仅包含来自一个时区的日期的数据样本运行此操作(即 len(df.TimeZoneCode.unique())= 1
),它会正常工作.一旦数据框中存在两个或多个时区,我就会得到一个 ValueError:不兼容或非tz意识的值
.
If I run this on a sample of the data that only contains dates from one timezone (i.e., len(df.TimeZoneCode.unique()) = 1
), it works fine. As soon as there are two or more timezones in the dataframe, I get a ValueError: incompatible or non tz-aware value
.
有人可以看到这里出了什么问题吗?
Can anyone see what is going wrong here?
推荐答案
演示:
源DF:
In [11]: df
Out[11]:
datetime time_zone
0 2016-09-19 01:29:13 America/Bogota
1 2016-09-19 02:16:04 America/New_York
2 2016-09-19 01:57:54 Africa/Cairo
3 2016-09-19 11:00:00 America/Bogota
4 2016-09-19 12:00:00 America/New_York
5 2016-09-19 13:00:00 Africa/Cairo
解决方案:
In [12]: df['new'] = df.groupby('time_zone')['datetime'] \
.transform(lambda x: x.dt.tz_localize(x.name))
In [13]: df
Out[13]:
datetime time_zone new
0 2016-09-19 01:29:13 America/Bogota 2016-09-19 06:29:13
1 2016-09-19 02:16:04 America/New_York 2016-09-19 06:16:04
2 2016-09-19 01:57:54 Africa/Cairo 2016-09-18 23:57:54
3 2016-09-19 11:00:00 America/Bogota 2016-09-19 16:00:00
4 2016-09-19 12:00:00 America/New_York 2016-09-19 16:00:00
5 2016-09-19 13:00:00 Africa/Cairo 2016-09-19 11:00:00
这篇关于在 pandas 数据框中高效地转换时区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!