使用 pandas 将月度数据表转换为季节性时间序列 [英] Convert monthly data table to seasonal time series using pandas
问题描述
我有一些表格格式的数据,其中行是年,列是月.我想将其转换为 Pandas 中的时间序列格式,然后按季节平均值对数据进行分组(我特别想将冬季定义为 11 月至 3 月,并丢弃其他所有内容).这是我将数据读入表格格式的代码,因此您可以看到发生了什么:
将pandas导入为pdheaders = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']df = pd.read_csv('https://www.esrl.noaa.gov/psd/gcos_wgsp/Timeseries/Data/nino34.long.anom.data',delimiter='\s+', header=0, names=headers, skiprows=1, index_col=0)df = df.drop(df.tail(7).index)df
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec1871 -0.25 -0.58 -0.43 -0.50 -0.70 -0.53 -0.60 -0.33 -0.24 -0.33 -0.31 -0.581872 -0.72 -0.62 -0.50 -0.77 -0.62 -0.52 -0.32 -0.85 -1.02 -0.94 -0.79 -0.881873 -0.78 -1.01 -1.31 -0.67 -0.53 -0.48 -0.58 -0.39 -0.34 -0.78 -0.77 -0.701874 -0.93 -1.06 -1.40 -0.94 -0.86 -0.72 -1.00 -1.05 -1.13 -1.25 -1.33 -1.141875 -0.71 -0.37 -0.59 -0.87 -1.09 -0.76 -0.85 -0.81 -0.91 -0.83 -0.64 -0.751876 -0.95 -1.20 -1.13 -1.18 -1.08 -0.43 -0.34 -0.16 -0.02 0.11 0.15 0.231877 0.35 0.46 0.52 0.50 0.76 0.98 1.42 1.54 1.75 1.95 2.08 2.491878 2.41 2.43 1.31 0.92 0.82 0.92 0.25 -0.11 -0.32 -0.53 -0.70 -0.751879 -0.55 -0.18 -0.24 -0.37 -0.83 -0.67 -0.77 -0.69 -0.83 -0.93 -1.14 -1.021880 -1.00 -0.73 -0.62 -0.57 -0.71 -0.61 -0.53 -0.24 -0.03 0.17 0.24 0.181881 0.29 0.23 0.32 0.41 0.16 0.23 -0.26 -0.17 -0.33 -0.43 -0.59 -0.371882 -0.45 -0.55 -0.53 0.10 0.01 -0.40 -0.57 -0.33 -0.51 -0.65 -0.92 -0.75
我想要的结果是每个冬季 11 月至 3 月值的平均值的时间序列(例如,2018 年 11 月、2018 年 12 月、2019 年 1 月、2019 年 2 月和 2019 年 3 月的平均值).我猜这个过程包括首先将这些数据分类到一个时间序列中(我不确定如何做),然后按季节(11 月至 3 月)分组,然后丢弃其他月份.
给定如图所示的数据
- 年份是索引
- 查看
I have some data in table format, where the rows are years, and the columns are months. I would like to convert this to a time series format in pandas, and then group the data by seasonal mean (I would specifically like to define the winter season as November-March, and discard everything else). Here's the code I have that reads the data into the table format, so you can see what's going on:
import pandas as pd headers = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] df = pd.read_csv('https://www.esrl.noaa.gov/psd/gcos_wgsp/Timeseries/Data/nino34.long.anom.data', delimiter='\s+', header=0, names=headers, skiprows=1, index_col=0) df = df.drop(df.tail(7).index) df
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1871 -0.25 -0.58 -0.43 -0.50 -0.70 -0.53 -0.60 -0.33 -0.24 -0.33 -0.31 -0.58 1872 -0.72 -0.62 -0.50 -0.77 -0.62 -0.52 -0.32 -0.85 -1.02 -0.94 -0.79 -0.88 1873 -0.78 -1.01 -1.31 -0.67 -0.53 -0.48 -0.58 -0.39 -0.34 -0.78 -0.77 -0.70 1874 -0.93 -1.06 -1.40 -0.94 -0.86 -0.72 -1.00 -1.05 -1.13 -1.25 -1.33 -1.14 1875 -0.71 -0.37 -0.59 -0.87 -1.09 -0.76 -0.85 -0.81 -0.91 -0.83 -0.64 -0.75 1876 -0.95 -1.20 -1.13 -1.18 -1.08 -0.43 -0.34 -0.16 -0.02 0.11 0.15 0.23 1877 0.35 0.46 0.52 0.50 0.76 0.98 1.42 1.54 1.75 1.95 2.08 2.49 1878 2.41 2.43 1.31 0.92 0.82 0.92 0.25 -0.11 -0.32 -0.53 -0.70 -0.75 1879 -0.55 -0.18 -0.24 -0.37 -0.83 -0.67 -0.77 -0.69 -0.83 -0.93 -1.14 -1.02 1880 -1.00 -0.73 -0.62 -0.57 -0.71 -0.61 -0.53 -0.24 -0.03 0.17 0.24 0.18 1881 0.29 0.23 0.32 0.41 0.16 0.23 -0.26 -0.17 -0.33 -0.43 -0.59 -0.37 1882 -0.45 -0.55 -0.53 0.10 0.01 -0.40 -0.57 -0.33 -0.51 -0.65 -0.92 -0.75
The result I'm going for is a time series of the mean of the November-March values of each winter season (for example, mean of Nov 2018, Dec 2018, Jan 2019, Feb 2019 & March 2019). I'm guessing the process involves first sorting this data into a timeseries (which I am unsure about how to do) and then grouping by season (Nov-March), and dropping the other months.
解决方案Given the data as shown
- year is the index
- review
pandas.DataFrame.rolling
to specify the window
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns # reset index df.reset_index(inplace=True) # rename the column to year df.rename(columns={'index': 'year'}, inplace=True) # melt df into a long format df2 = df.melt(id_vars='year', var_name='month') # create a datatime column and set it as index df2['date'] = pd.to_datetime(df2['year'].astype(str) + '-' + df2['month']) df2.set_index('date', inplace=True) # define a winter dataframe winter = df2[df2['month'].isin(['Nov', 'Dec', 'Jan', 'Feb', 'Mar'])].copy() # sort the index winter.sort_index(inplace=True)
winter dataframe
year month value date 1871-01-01 1871 Jan -0.25 1871-02-01 1871 Feb -0.58 1871-03-01 1871 Mar -0.43 1871-11-01 1871 Nov -0.31 1871-12-01 1871 Dec -0.58 1872-01-01 1872 Jan -0.72 1872-02-01 1872 Feb -0.62 1872-03-01 1872 Mar -0.50 1872-11-01 1872 Nov -0.79 1872-12-01 1872 Dec -0.88 1873-01-01 1873 Jan -0.78 1873-02-01 1873 Feb -1.01 1873-03-01 1873 Mar -1.31 1873-11-01 1873 Nov -0.77 1873-12-01 1873 Dec -0.70 1874-01-01 1874 Jan -0.93 1874-02-01 1874 Feb -1.06 1874-03-01 1874 Mar -1.40 1874-11-01 1874 Nov -1.33 1874-12-01 1874 Dec -1.14 1875-01-01 1875 Jan -0.71 1875-02-01 1875 Feb -0.37 1875-03-01 1875 Mar -0.59 1875-11-01 1875 Nov -0.64 1875-12-01 1875 Dec -0.75 1876-01-01 1876 Jan -0.95 1876-02-01 1876 Feb -1.20 1876-03-01 1876 Mar -1.13 1876-11-01 1876 Nov 0.15 1876-12-01 1876 Dec 0.23 1877-01-01 1877 Jan 0.35 1877-02-01 1877 Feb 0.46 1877-03-01 1877 Mar 0.52 1877-11-01 1877 Nov 2.08 1877-12-01 1877 Dec 2.49 1878-01-01 1878 Jan 2.41 1878-02-01 1878 Feb 2.43 1878-03-01 1878 Mar 1.31 1878-11-01 1878 Nov -0.70 1878-12-01 1878 Dec -0.75 1879-01-01 1879 Jan -0.55 1879-02-01 1879 Feb -0.18 1879-03-01 1879 Mar -0.24 1879-11-01 1879 Nov -1.14 1879-12-01 1879 Dec -1.02 1880-01-01 1880 Jan -1.00 1880-02-01 1880 Feb -0.73 1880-03-01 1880 Mar -0.62 1880-11-01 1880 Nov 0.24 1880-12-01 1880 Dec 0.18 1881-01-01 1881 Jan 0.29 1881-02-01 1881 Feb 0.23 1881-03-01 1881 Mar 0.32 1881-11-01 1881 Nov -0.59 1881-12-01 1881 Dec -0.37 1882-01-01 1882 Jan -0.45 1882-02-01 1882 Feb -0.55 1882-03-01 1882 Mar -0.53 1882-11-01 1882 Nov -0.92 1882-12-01 1882 Dec -0.75
- In this example, the winter of 1870 and 1882 are incomplete, comprised of 3 and 2 months, respectively.
- Those 5 dates will be dropped
- Compute the average for each season
# drop incomplete season at the beginning and end of the winter dataframe winter.drop([pd.Timestamp('1871-01-01'), pd.Timestamp('1871-02-01'), pd.Timestamp('1871-03-01'), pd.Timestamp('1882-11-01'), pd.Timestamp('1882-12-01')], inplace=True) # compute mean of season, every 5 rows yearly_avg = winter.groupby(winter.reset_index().index // 5)['value'].mean() # create label for season start data winter_year = pd.date_range(start='1871-11-01', periods=11, freq='12MS') # create winter_avg dataframe winter_avg = pd.DataFrame({'winter_start': winter_year, 'season_avg': yearly_avg}) winter_avg.set_index('winter_start', inplace=True) season_avg winter_start 1871-11-01 -0.546 1872-11-01 -0.954 1873-11-01 -0.972 1874-11-01 -0.828 1875-11-01 -0.934 1876-11-01 0.342 1877-11-01 2.144 1878-11-01 -0.484 1879-11-01 -0.902 1880-11-01 0.252 1881-11-01 -0.498
Plot the data:
# prevent future warning from pandas.plotting import register_matplotlib_converters register_matplotlib_converters() sns.lineplot(winter.index, winter['value'], label='all values') sns.scatterplot(winter_avg.index, winter_avg['season_avg'], label='season avg', color='orange') plt.ylabel('value') plt.xlabel('year') plt.show()
这篇关于使用 pandas 将月度数据表转换为季节性时间序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!