根据当地时间计算24小时周期内每分钟的平均销售额(HH:MM) [英] calculate mean sales per minute accross 24-hour cycles as per local-time (HH:MM)

查看:101
本文介绍了根据当地时间计算24小时周期内每分钟的平均销售额(HH:MM)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在此示例中,我们以1分钟的分辨率采样了两天的数据,共进行了2880次测量.分别依次跨多个时区收集测量值:欧洲/伦敦的前240分钟,美国/洛杉矶"的其余2640个测量值.

In this example we have two days of data sampled at a resolution of 1min, giving us 2880 measurements. The measurements are collected across multiple timezones sequentially: the first 240 minutes in Europe/London and the remaining 2640 measurements in 'America/Los_Angeles'.

import pandas as pd
import numpy as np
df=pd.DataFrame(index=pd.DatetimeIndex(pd.date_range('2015-03-29 00:00','2015-03-30 23:59',freq='1min',tz='UTC')))
df.loc['2015-03-29 00:00':'2015-03-29 04:00','timezone']='Europe/London'
df.loc['2015-03-29 04:00':'2015-03-30 23:59','timezone']='America/Los_Angeles'
df['sales1']=np.random.random_integers(100,size=len(df))
df['sales2']=np.random.random_integers(10,size=len(df))

要计算多天的24小时周期内每分钟的平均销售额(按照UTC时间),以下方法可以很好地发挥作用:

To calculate mean sales per minute (as per UTC time) in a 24-hour cycle across multiple days the following approach works nicely:

utc_sales=df.groupby([df.index.hour,df.index.minute]).mean()
utc_sales.set_index(pd.date_range("00:00","23:59", freq="1min").time,inplace=True)

这种分组方法也可以用于基于其他两个时区之一(例如欧洲/伦敦")计算平均销售额.

This groupby approach can also be applied to calculate mean sales based on one of the other two timezones, say 'Europe/London'.

df['London']=df.index.tz_convert('Europe/London')
london_sales=df.groupby([df['London'].dt.hour,df['London'].dt.minute]).mean()
london_sales.set_index(pd.date_range("00:00","23:59", freq="1min").time,inplace=True)

但是,我正在努力提出一种有效的方法来计算24小时周期内每分钟的平均销售额(按当地时间).我从上面尝试了相同的方法,但是当同一系列中存在多个时区时,groupby会恢复为utc中的索引.

However I'm struggling to come up with an efficient way of calculating mean sales per minute -as per localtime- in a 24-hour cycle. I attempted the same approach from above, however when multiple timezones are present in the same series, groupby reverts back to the index which is in utc.

def calculate_localtime(x):
    return pd.to_datetime(x.name,unit='s').tz_convert(x['timezone'])
df['localtime']=df.apply(calculate_localtime,axis=1)
local_sales=df.groupby([df['localtime'].dt.hour,df['localtime'].dt.minute]).mean()
local_sales.set_index(pd.date_range("00:00","23:59",freq="1min").time,inplace=True)

我们可以验证local_sales与utc_sales相同,因此这种方法行不通.

We can verify that local_sales are identical to utc_sales, and therefore this approach does not work.

In [8]: np.unique(local_sales == utc_sales)
Out[8]: array([ True], dtype=bool)

有人可以推荐一种适用于大型数据集和多个时区的方法吗?

Could anyone recommend an approach suitable for large datatsets and multiple timezones?

推荐答案

这里是一种获取我认为想要的方法.这需要熊猫0.17.0

Here is an approach to get what I think you want. This requires pandas 0.17.0

一口气创建数据

import pandas as pd
import numpy as np

pd.options.display.max_rows=12
np.random.seed(1234)
df=pd.DataFrame(index=pd.DatetimeIndex(pd.date_range('2015-03-29 00:00','2015-03-30 23:59',freq='1min',tz='UTC')))
df.loc['2015-03-29 00:00':'2015-03-29 04:00','timezone']='Europe/London'
df.loc['2015-03-29 04:00':'2015-03-30 23:59','timezone']='America/Los_Angeles'
df['sales1']=np.random.random_integers(100,size=len(df))
df['sales2']=np.random.random_integers(10,size=len(df))

In [79]: df
Out[79]: 
                                      timezone  sales1  sales2
2015-03-29 00:00:00+00:00        Europe/London      48       6
2015-03-29 00:01:00+00:00        Europe/London      84       1
2015-03-29 00:02:00+00:00        Europe/London      39       1
2015-03-29 00:03:00+00:00        Europe/London      54      10
2015-03-29 00:04:00+00:00        Europe/London      77       5
2015-03-29 00:05:00+00:00        Europe/London      25       9
...                                        ...     ...     ...
2015-03-30 23:54:00+00:00  America/Los_Angeles      77       8
2015-03-30 23:55:00+00:00  America/Los_Angeles      16       4
2015-03-30 23:56:00+00:00  America/Los_Angeles      55       3
2015-03-30 23:57:00+00:00  America/Los_Angeles      18       1
2015-03-30 23:58:00+00:00  America/Los_Angeles       3       2
2015-03-30 23:59:00+00:00  America/Los_Angeles      52       2

[2880 rows x 3 columns]

根据时区进行透视;这会创建一个时区分开的多索引

Pivot according to the timezone; this creates a multi-index with the timezone separated

    x = pd.pivot_table(df.reset_index(),values=['sales1','sales2'],index='index',columns='timezone').swaplevel(0,1,axis=1)
    x.columns.names = ['timezone','sales']

In [82]: x
Out[82]: 
timezone                  America/Los_Angeles Europe/London America/Los_Angeles Europe/London
sales                                  sales1        sales1              sales2        sales2
index                                                                                        
2015-03-29 00:00:00+00:00                 NaN            48                 NaN             6
2015-03-29 00:01:00+00:00                 NaN            84                 NaN             1
2015-03-29 00:02:00+00:00                 NaN            39                 NaN             1
2015-03-29 00:03:00+00:00                 NaN            54                 NaN            10
2015-03-29 00:04:00+00:00                 NaN            77                 NaN             5
2015-03-29 00:05:00+00:00                 NaN            25                 NaN             9
...                                       ...           ...                 ...           ...
2015-03-30 23:54:00+00:00                  77           NaN                   8           NaN
2015-03-30 23:55:00+00:00                  16           NaN                   4           NaN
2015-03-30 23:56:00+00:00                  55           NaN                   3           NaN
2015-03-30 23:57:00+00:00                  18           NaN                   1           NaN
2015-03-30 23:58:00+00:00                   3           NaN                   2           NaN
2015-03-30 23:59:00+00:00                  52           NaN                   2           NaN

[2880 rows x 4 columns]

在本地区域中创建我们要使用的石斑鱼,即小时和分钟.我们将根据蒙版IOW填充它们.如果sales1/sales2都不为空,我们将使用该(本地)区域的小时/分钟

Create the groupers that we want to use, namely hours and minutes in the local zone. We are going to populate them according to the mask, IOW. where both sales1/sales2 are notnull, we will use the hours/minutes for that (local) zone

hours = pd.Series(index=x.index)
minutes = pd.Series(index=x.index)
for tz in ['America/Los_Angeles', 'Europe/London' ]:

   local = df.index.tz_convert(tz)
   x[(tz,'tz')] = local

   mask = x[(tz,'sales1')].notnull() & x[(tz,'sales2')].notnull()
   hours.iloc[mask.values] = local.hour[mask.values]
   minutes.iloc[mask.values] = local.minute[mask.values]

x = x.sortlevel(axis=1)

以上之后. (请注意,这可能有点简化,这意味着我们不需要实际记录本地时区,只需计算小时/分钟即可.)

After the above. (Note this could be a bit simplified, meaning that we don't need to actually record the local timezone, just compute hours/minutes).

Out[84]: 
timezone                  America/Los_Angeles                                  Europe/London                                 
sales                                  sales1 sales2                        tz        sales1 sales2                        tz
index                                                                                                                        
2015-03-29 00:00:00+00:00                 NaN    NaN 2015-03-28 17:00:00-07:00            48      6 2015-03-29 00:00:00+00:00
2015-03-29 00:01:00+00:00                 NaN    NaN 2015-03-28 17:01:00-07:00            84      1 2015-03-29 00:01:00+00:00
2015-03-29 00:02:00+00:00                 NaN    NaN 2015-03-28 17:02:00-07:00            39      1 2015-03-29 00:02:00+00:00
2015-03-29 00:03:00+00:00                 NaN    NaN 2015-03-28 17:03:00-07:00            54     10 2015-03-29 00:03:00+00:00
2015-03-29 00:04:00+00:00                 NaN    NaN 2015-03-28 17:04:00-07:00            77      5 2015-03-29 00:04:00+00:00
2015-03-29 00:05:00+00:00                 NaN    NaN 2015-03-28 17:05:00-07:00            25      9 2015-03-29 00:05:00+00:00
...                                       ...    ...                       ...           ...    ...                       ...
2015-03-30 23:54:00+00:00                  77      8 2015-03-30 16:54:00-07:00           NaN    NaN 2015-03-31 00:54:00+01:00
2015-03-30 23:55:00+00:00                  16      4 2015-03-30 16:55:00-07:00           NaN    NaN 2015-03-31 00:55:00+01:00
2015-03-30 23:56:00+00:00                  55      3 2015-03-30 16:56:00-07:00           NaN    NaN 2015-03-31 00:56:00+01:00
2015-03-30 23:57:00+00:00                  18      1 2015-03-30 16:57:00-07:00           NaN    NaN 2015-03-31 00:57:00+01:00
2015-03-30 23:58:00+00:00                   3      2 2015-03-30 16:58:00-07:00           NaN    NaN 2015-03-31 00:58:00+01:00
2015-03-30 23:59:00+00:00                  52      2 2015-03-30 16:59:00-07:00           NaN    NaN 2015-03-31 00:59:00+01:00

[2880 rows x 6 columns]

这使用时区(在0.17.0中)的新表示形式.

This uses the new representation for timezones (in 0.17.0).

In [85]: x.dtypes
Out[85]: 
timezone             sales 
America/Los_Angeles  sales1                                float64
                     sales2                                float64
                     tz        datetime64[ns, America/Los_Angeles]
Europe/London        sales1                                float64
                     sales2                                float64
                     tz              datetime64[ns, Europe/London]
dtype: object

结果

x.groupby([hours,minutes]).mean()

timezone America/Los_Angeles        Europe/London       
sales                 sales1 sales2        sales1 sales2
0  0                    62.5    5.5            48      6
   1                    52.0    7.0            84      1
   2                    89.0    3.5            39      1
   3                    67.5    6.5            54     10
   4                    41.0    5.5            77      5
   5                    81.0    5.5            25      9
...                      ...    ...           ...    ...
23 54                   76.5    4.5           NaN    NaN
   55                   37.5    5.0           NaN    NaN
   56                   60.5    8.0           NaN    NaN
   57                   87.5    7.0           NaN    NaN
   58                   77.5    6.0           NaN    NaN
   59                   31.0    5.5           NaN    NaN

[1440 rows x 4 columns]

这篇关于根据当地时间计算24小时周期内每分钟的平均销售额(HH:MM)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆