使用 Pandas 获取时间序列数据的正确聚合输出的任何快速方法? [英] any quick way to get correct aggregation output for time series data using pandas?

查看:27
本文介绍了使用 Pandas 获取时间序列数据的正确聚合输出的任何快速方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用了 Redfin 房地产数据,其中记录了芝加哥地区每个地区多年来的每月房屋销售价格.我想先计算这个城市的年平均房屋销售价格,同时,我还想得到每个地区的年度房屋销售价格变化,然后我想将每个地区的年度销售价格变化与各自的平均年度房屋销售价格进行比较在城市中,我想引入新的列,这些列在一年中的每个地区都有二进制值 (1, 0).如果各地区的房屋销售价格变化大于该变化的年平均房屋销售价格变化,则加1,否则加0.

I have used Redfin real estate data where monthly house sale price along the years are recorded for each region in Chicago area. I want to calculate yearly average house sale price of the city first, meanwhile, I also want to get yearly house sale price change for each region, then I want to compare the yearly sale price change of each region with respective avg yearly house sale price in the city and I want to introduce new columns which have binary values (1, 0) for each region along the year. If the house sale price change of each region is greater than average yearly house sale price change of the change, add 1, 0 otherwise.

例如,在 2012 年 2 月至 2013 年 2 月之间,奥斯汀地区的年房价变化为 5%,芝加哥地区的平均年房价为 7%,因此我可以增加价值 0 进入 price_label 列.

for instance, between 2012-Feb ~ 2013 Feb, yearly house sale price change in Austin region is 5 %, avg yearly house sale price in Chicago area is 7 %, so I can add value 0 into price_label column.

如何轻松地对时间序列数据进行这种聚合?有什么办法可以做到这一点?

How can I make this sort of aggregation easily for time series data? any way to get this done?

我多次发布了我的问题,同时我尝试了自己的问题,但没有得到正确的输出.谁能指出我如何获得正确的解决方案?谢谢

I posted my question several times meanwhile I tried own my own and didn't get correct output. can anyone point me out how to get correct solution? Thanks

示例数据:

dicts = {'Region': {0: 'Chicago, IL metro area',
  1: 'Chicago, IL',
  2: 'Chicago, IL - Albany Park',
  3: 'Chicago, IL - Andersonville'},
 Timestamp('2012-02-01 00:00:00'): {0: 88.4, 1: 95.1, 2: 76.8, 3: 193.4},
 Timestamp('2012-03-01 00:00:00'): {0: 93.3, 1: 103.6, 2: 77.9, 3: 169.2},
 Timestamp('2012-04-01 00:00:00'): {0: 97.6, 1: 120.4, 2: 80.9, 3: 157.3},
 Timestamp('2012-05-01 00:00:00'): {0: 102.0, 1: 130.6, 2: 98.4, 3: 156.8},
 Timestamp('2012-06-01 00:00:00'): {0: 110.7, 1: 150.8, 2: 109.8, 3: 175.4},
 Timestamp('2012-07-01 00:00:00'): {0: 109.3, 1: 133.6, 2: 102.6, 3: 188.8},
 Timestamp('2012-08-01 00:00:00'): {0: 106.9, 1: 140.5, 2: 89.0, 3: 194.8},
 Timestamp('2012-09-01 00:00:00'): {0: 103.4, 1: 137.5, 2: 87.5, 3: 206.9},
 Timestamp('2012-10-01 00:00:00'): {0: 98.5, 1: 121.4, 2: 98.7, 3: 209.2},
 Timestamp('2012-11-01 00:00:00'): {0: 97.8, 1: 125.0, 2: 94.1, 3: 211.5},
 Timestamp('2012-12-01 00:00:00'): {0: 97.1, 1: 120.9, 2: 93.3, 3: 183.8},
 Timestamp('2013-01-01 00:00:00'): {0: 94.4, 1: 110.7, 2: 89.4, 3: 181.4},
 Timestamp('2013-02-01 00:00:00'): {0: 91.1, 1: 104.8, 2: 95.1, 3: 177.2},
 Timestamp('2013-03-01 00:00:00'): {0: 94.7, 1: 123.0, 2: 94.9, 3: 180.6},
 Timestamp('2013-04-01 00:00:00'): {0: 100.9, 1: 126.8, 2: 101.4, 3: 203.0},
 Timestamp('2013-05-01 00:00:00'): {0: 109.3, 1: 156.1, 2: 127.9, 3: 218.0},
 Timestamp('2013-06-01 00:00:00'): {0: 116.8, 1: 165.2, 2: 125.0, 3: 218.0},
 Timestamp('2013-07-01 00:00:00'): {0: 120.8, 1: 168.2, 2: 120.8, 3: 220.3},
 Timestamp('2013-08-01 00:00:00'): {0: 119.8, 1: 164.7, 2: 113.6, 3: 208.3},
 Timestamp('2013-09-01 00:00:00'): {0: 114.2, 1: 158.5, 2: 115.3, 3: 209.7},
 Timestamp('2013-10-01 00:00:00'): {0: 116.0, 1: 156.9, 2: 127.9, 3: 205.4},
 Timestamp('2013-11-01 00:00:00'): {0: 110.0, 1: 135.3, 2: 130.5, 3: 215.0},
 Timestamp('2013-12-01 00:00:00'): {0: 112.6, 1: 146.0, 2: 126.6, 3: 212.5},
 Timestamp('2014-01-01 00:00:00'): {0: 105.2, 1: 127.9, 2: 112.3, 3: 205.7},
 Timestamp('2014-02-01 00:00:00'): {0: 104.2, 1: 126.9, 2: 106.7, 3: 202.9},
 Timestamp('2014-03-01 00:00:00'): {0: 107.1, 1: 138.5, 2: 114.3, 3: 200.0},
 Timestamp('2014-04-01 00:00:00'): {0: 114.8, 1: 155.9, 2: 119.3, 3: 210.9},
 Timestamp('2014-05-01 00:00:00'): {0: 120.1, 1: 179.4, 2: 134.5, 3: 215.4},
 Timestamp('2014-06-01 00:00:00'): {0: 126.4, 1: 186.8, 2: 141.5, 3: 225.5},
 Timestamp('2014-07-01 00:00:00'): {0: 127.2, 1: 187.5, 2: 152.1, 3: 225.5},
 Timestamp('2014-08-01 00:00:00'): {0: 128.8, 1: 186.1, 2: 156.9, 3: 222.1},
 Timestamp('2014-09-01 00:00:00'): {0: 122.2, 1: 183.3, 2: 145.1, 3: 213.2},
 Timestamp('2014-10-01 00:00:00'): {0: 120.8, 1: 161.6, 2: 147.7, 3: 228.8},
 Timestamp('2014-11-01 00:00:00'): {0: 116.7, 1: 151.3, 2: 144.4, 3: 226.3},
 Timestamp('2014-12-01 00:00:00'): {0: 117.2, 1: 154.0, 2: 145.1, 3: 238.8},
 Timestamp('2015-01-01 00:00:00'): {0: 113.4, 1: 145.8, 2: 137.2, 3: 221.6},
 Timestamp('2015-02-01 00:00:00'): {0: 108.7, 1: 139.8, 2: 140.7, 3: 232.0}}

这是字典中时间序列数据的示例数据片段:

here is example data snippet of time series data in the dictionary:

我的尝试:

import numpy as np
import pandas as pd

df_= pd.DataFrame([dicts.keys(), dicts.values()])
df_.columns=df_.columns.astype(str)
house_df=house_df.set_index('Region')
house_df.columns=pd.to_datetime(df_.columns)

def ratio(df):
    return np.exp(np.log(df).diff()) - 1

keys = ['Region']
pd.concat([df_, df_.groupby('Region')[df_.columns].apply(ratio)],
          axis=1, keys=keys)

但以上尝试没有返回正确的预期聚合结果.我该怎么办?有什么想法可以做到这一点吗?我尝试了很多方法,但仍然没有得到我想要的.谁能指出我如何做到这一点?

but above attempt didn't return correct expected aggregation result. what should I do? any idea to make this happen? I tried many way but still didn't get what I want. can anyone point me out how to make this right?

更新

或者,我想比较各个地区多年来的月度变化与年度平均变化.任何可能的想法使这种聚合发生?谢谢

alternatively, I want to compare the monthly changes across the years to the yearly average changes for each region. Any possible idea to make this aggregation happens? Thanks

期望输出

如果单个城市的房价变化大于该城市的平均年房价变化,我想获得数据框,其中每个地区的年房价百分比将被添加为新列,然后我将添加二进制值如 1, 0 否则.

I want to get dataframe where yearly house price percentage of each individual district would be added as new columns if house price change of individual city is greater than average yearly house price change of the city, then I am going to add binary value such as 1, 0 otherwise.

expected_output = pd.DataFrame({'Year': ['2012', '2013', '2014', '2015', '2012', '2013', '2014', '2015', '2012', '2013', '2014', '2015'], 
                     'Area': ['Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL metro area', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park', 'Chicago, IL - Albany Park'],'yearly_price_change': ['5%', '10%', '7%','21%', '15%', '12%', '2%','21%', '10%', '11%', '12%','6%'],
                     'price_label':[0, 1, 0,1,1,1,0,1,1,1,1,0]})

有完成这件事的想法吗?我怎样才能像我预期的数据帧一样获得正确的聚合?我怎样才能做到这一点?有什么想法吗?谢谢

any idea to get this done? how can I get correct aggregation like my expected dataframe? how can I make this happen? any thoughts? thanks

推荐答案

这是我的看法:

# prepare the data frame
df = pd.DataFrame(dicts).set_index('Region')
df.columns = pd.to_datetime(df.columns)

df = df.stack().reset_index()
df.columns = ['Region', 'date', 'price']
df.head()

#    Region                  date                   price
#--  ----------------------  -------------------  -------
# 0  Chicago, IL metro area  2012-02-01 00:00:00     88.4
# 1  Chicago, IL metro area  2012-03-01 00:00:00     93.3
# 2  Chicago, IL metro area  2012-04-01 00:00:00     97.6
# 3  Chicago, IL metro area  2012-05-01 00:00:00    102
# 4  Chicago, IL metro area  2012-06-01 00:00:00    110.7

# get the price change over month, as I understand from the question
df['price_change'] = df.groupby('Region').price.apply(lambda x: x.diff().abs()/x)

# compute mean over the years and regions
new_df = df.groupby(['Region', df.date.dt.year])[['price_change']].mean()

# compute the price_label
new_df['price_label'] = new_df.groupby(level=0).apply(lambda x: (x>x.mean()).astype(int))
new_df

#                                     price_change
#date  Region                     
#2012  Chicago, IL                    0.082864
#      Chicago, IL - Albany Park      0.074394
#      Chicago, IL - Andersonville    0.066074
#      Chicago, IL metro area         0.035153
#2013  Chicago, IL                    0.074208
#      Chicago, IL - Albany Park      0.055192
#      Chicago, IL - Andersonville    0.032249
#      Chicago, IL metro area         0.040750
#2014  Chicago, IL                    0.063483
#      Chicago, IL - Albany Park      0.056466
#      Chicago, IL - Andersonville    0.030612
#      Chicago, IL metro area         0.032648
#2015  Chicago, IL                    0.049580
#      Chicago, IL - Albany Park      0.041228
#      Chicago, IL - Andersonville    0.061222
#      Chicago, IL metro area         0.038374
#Name: price_change, dtype: float64

# here we compute the average across the years for each region
# groupby(level=1) will gather all the records of same region (level 1)
# if you want average across the regions for each year,
# change to groupby(level=0), i.e. gather all records of same year.
new_df['price_label'] = new_df.groupby(level=1).apply(lambda x: (x>x.mean()).astype(int))

new_df

输出:

+------------------------------+-------+---------------+-------------+
|                              |       | price_change  | price_label |
+------------------------------+-------+---------------+-------------+
| Region                       | date  |               |             |
+------------------------------+-------+---------------+-------------+
| Chicago, IL                  | 2012  | 0.082864      |           1 |
|                              | 2013  | 0.074208      |           1 |
|                              | 2014  | 0.063483      |           0 |
|                              | 2015  | 0.049580      |           0 |
| Chicago, IL - Albany Park    | 2012  | 0.074394      |           1 |
|                              | 2013  | 0.055192      |           0 |
|                              | 2014  | 0.056466      |           0 |
|                              | 2015  | 0.041228      |           0 |
| Chicago, IL - Andersonville  | 2012  | 0.066074      |           1 |
|                              | 2013  | 0.032249      |           0 |
|                              | 2014  | 0.030612      |           0 |
|                              | 2015  | 0.061222      |           1 |
| Chicago, IL metro area       | 2012  | 0.035153      |           0 |
|                              | 2013  | 0.040750      |           1 |
|                              | 2014  | 0.032648      |           0 |
|                              | 2015  | 0.038374      |           1 |
+------------------------------+-------+---------------+-------------+

我可能会误解一些东西,但这就是要点:-).

I may misunderstand something, but that's the gist :-).

这篇关于使用 Pandas 获取时间序列数据的正确聚合输出的任何快速方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆