缺少季度数据的 pandas 数据框中的滞后值和差异 [英] Lag values and differences in pandas dataframe with missing quarterly data
问题描述
尽管Pandas具有时序功能,但我仍在努力处理时序数据不完整的数据框.
Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
参见下图,下图具有完整的数据,上图具有间隙.两张图片都显示正确的值.红色是我要使用黑色数据计算的列. Cumm_Issd
列显示了当年累计发行的股票,MV
是市值.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd
shows the accumulated issued shares during the year, MV
is market value.
我想计算每个季度的已发行股数(IssdQtr
),市值的季度变化(D_MV_Q
)和去年的MV
(L_MV_Y
).
I want to calculate the issued shares per quarter (IssdQtr
), the quarterly change in Market Value (D_MV_Q
) and the MV
of last year (L_MV_Y
).
请参见此链接以获取基本的cvs数据,以获取全部数据和此信息链接.有两个公司1020180
和1020201
.
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180
and 1020201
.
但是,当我尝试使用熊猫shift
方法时,如果存在间隙,它将失败,请尝试使用csv文件和下面的代码进行尝试.所有列(DiffEq, Dif1MV, Lag4MV
)与IssdQtr, D_MV_Q, L_MV_Y
有所不同-分别有几个季度.
However, when I try Pandas shift
method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV
) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y
, respectively.
有没有办法使用熊猫来处理数据中的空白?
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
差距数据:
完整数据:
推荐答案
通过合并解决了基本问题.首先,创建一个显示滞后日期或季度的变量.在这里,我们要获取去年的MV(返回4个季度):
Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
然后使用键(固定和日期)和相关变量(此处为MV)创建一个数据框.
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
最后,将新框架合并到现有框架中:
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')
这篇关于缺少季度数据的 pandas 数据框中的滞后值和差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!