连续日期的 pandas 行分析 [英] Pandas row analysis for consecutive dates

查看:129
本文介绍了连续日期的 pandas 行分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



目前,我正在读取一个有5列兴趣的CSV文件(基于保险单):

  CONTRACT_ID开始日期结束日期CANCEL_FLAG OLD_CON_ID 
123456 2015-05-30 -30 0 8788
123457 2014-03-20 2015-03-20 0 12000
123458 2009-12-20 2010-12-20 0 NaN
...

我想计算合同链所连续月份的数量。



示例:从链条的前端(最旧合同)的合同中取得 START-DATE 以及从链末尾(最新合同)的 END-DATE 。最旧的合同是由之前链接中取消的合同或没有 OLD_CON_ID 值的一个

$ b定义的。
$ b

每行代表一个合同,prev_Con_ID指向以前的合同ID。期望的产出是合约链回溯到差距(即客户在一段时间内没有合同)的几个月。如果该列中没有任何内容,那么这是该链中的第一个合同。



CANCEL_FLAG也应该裁剪链,因为值为1表示合同已被取消。 p>

当前代码通过编辑数据框来计算每年的活动合同数量:

  df_contract = df_contract [
(df_contract ['START_DATE']< = pd.to_datetime('2015-05-31'))&
(df_contract ['END_DATE']> = pd.to_datetime('2015-05-31'))& (df_contract ['CANCEL_FLAG'] == 0)
]
df_contract = df_contract [df_contract ['CANCEL_FLAG'] == 0
]
activecount = df_contract.count()
print activecount ['CONTRACT_ID']

以下是我的第6行代码创建数据框并调整日期时间值:

  file_name ='EXAMPLENAME.csv'
df = pd.read_csv file_name)
df_contract = pd.read_csv(file_name)
df_CUSTOMERS = pd.read_csv(file_name)

df_contract ['START_DATE'] = pd.to_datetime(df_contract ['START_DATE' ])
df_contract ['END_DATE'] = pd.to_datetime(df_contract ['END_DATE'])

理想的输出是这样的:

  FIRST_CONTRACT_ID CHAIN_LENGTH CON_MONTHS 
1234567 5 60
1500001 1 4
800 10 180

然后将绘制这些数据点。



EDIT2:CSV文件已更改,可能会更容易。问题更新了。

解决方案

经过大量的尝试和错误,我得到了它的工作!



找到链中的第一个和最后一个合同之间的时差,找到链的长度。



不最干净的代码,但它的作品:

  test ='START_DATE'


df_short = df_policy [['OLD_CON_ID',test,'CONTRACT_ID']]
df_short.rename(columns = {'OLD_CON_ID':'PID','CONTRACT_ID':'CID'},

inplace = True)
df_test = df_policy [['CONTRACT_ID','END_DATE']]
df_test.rename(columns = {'CONTRACT_ID':'CID','END_DATE':'PED '},inplace = True


df_copy1 = df_short.copy()
df_copy2 = df_short.copy()
df_copy2.rename(columns = {'PID ':'PPID','CID':'PID'},inplace = True

df_merge1 = pd.merge(df_short,df_copy2,
how ='left',
on = ['P ID'])

df_merge1 ['START_DATE_y']。fillna(df_merge1 ['START_DATE_x'],inplace = True)
df_merge1.rename(columns = {'START_DATE_x':'1_EFF' ,'START_DATE_y':'2_EFF'},inplace = True

复制,合并,填充,对于5个合并的数据框重复重命名代码,然后:

  df_merged = pd.merge(df_merge5,df_test,
how ='right',
on = ['CID'])

df_merged ['TOTAL_MONTHS'] =((df_merged ['PED'] - df_merged ['6_EFF']
)/np.timedelta64(1,'M'))

df_merged4 = df_merged [
(df_merged ['PED']> = pd.to_datetime('2015-07-06 '))
df_merged4 ['CHAIN_LENGTH'] = df_merged4.drop(['PED','1_EFF','2_EFF','3_EFF','4_EFF','5_EFF'],轴= 1) (lambda row:len(pd.unique(row)),axis = 1)-3

希望我的代码是被理解的,并且将来会帮助某人。


Following a "chain" of rows and counting the consecutive months from a CSV file.

Currently I am reading a CSV file with 5 columns of interest (based on insurance policies):

CONTRACT_ID   START-DATE           END-DATE        CANCEL_FLAG    OLD_CON_ID
123456        2015-05-30           2016-05-30       0             8788
123457        2014-03-20           2015-03-20       0             12000
123458        2009-12-20           2010-12-20       0             NaN
...

I want to count the number of consecutive months a Contract chain goes for.

Example: Taking the START-DATE from the contract at the "front" of the chain (oldest contract) and the END-DATE from the end of the chain (newest contract). Oldest contract being defined by either the one before a cancelled contract in a chain or the one that has no OLD_CON_ID value.

Each row represents a contract and the prev_Con_ID points to the previous contract ID. The desired output is how many months the contract chains goes back until a gap (i.e. customer didn't have a contract for a period of time). If nothing in that column then that is the first contract in this chain.

CANCEL_FLAG should also cut the chain because a value of 1 designates that the contract was cancelled.

Current code counts the number of active contracts for each year by editing the dataframe like so:

df_contract = df_contract[
(df_contract['START_DATE'] <= pd.to_datetime('2015-05-31')) & 
(df_contract['END_DATE'] >= pd.to_datetime('2015-05-31')) & (df_contract['CANCEL_FLAG'] == 0 )
]
df_contract = df_contract[df_contract['CANCEL_FLAG'] == 0
]
activecount = df_contract.count()
print activecount['CONTRACT_ID']

Here are the first 6 lines of code in which I create the dataframes and adjust the datetime values:

file_name = 'EXAMPLENAME.csv'
df = pd.read_csv(file_name)
df_contract = pd.read_csv(file_name)
df_CUSTOMERS = pd.read_csv(file_name)

df_contract['START_DATE'] = pd.to_datetime(df_contract['START_DATE'])
df_contract['END_DATE'] = pd.to_datetime(df_contract['END_DATE'])

Ideal output is something like:

FIRST_CONTRACT_ID       CHAIN_LENGTH       CON_MONTHS
1234567                 5                  60
1500001                 1                  4
800                     10                 180

Those data points would then be graphed.

EDIT2: CSV file changed, might be easier now. Question updated.

解决方案

After a lot of trial and error I got it working!

This finds the time difference between the first and last contracts in the chain and finds the length of the chain.

Not the cleanest code by far, but it works:

test = 'START_DATE'


df_short = df_policy[['OLD_CON_ID',test,'CONTRACT_ID']]
df_short.rename(columns={'OLD_CON_ID':'PID','CONTRACT_ID':'CID'}, 

inplace = True)
df_test = df_policy[['CONTRACT_ID','END_DATE']]
df_test.rename(columns={'CONTRACT_ID':'CID','END_DATE': 'PED'}, inplace = True)


df_copy1 = df_short.copy()
df_copy2 = df_short.copy()
df_copy2.rename(columns={'PID':'PPID','CID':'PID'}, inplace = True)

df_merge1 = pd.merge(df_short, df_copy2,
    how='left',
    on=['PID'])

df_merge1['START_DATE_y'].fillna(df_merge1['START_DATE_x'], inplace = True)
df_merge1.rename(columns={'START_DATE_x':'1_EFF','START_DATE_y':'2_EFF'}, inplace=True)

The copy, merge, fillna, and rename code is repeated for 5 merged dataframes then:

df_merged = pd.merge(df_merge5, df_test,
    how='right',
    on=['CID'])

df_merged['TOTAL_MONTHS'] = ((df_merged['PED'] - df_merged['6_EFF']
                             )/np.timedelta64(1,'M'))

df_merged4 = df_merged[
    (df_merged['PED'] >= pd.to_datetime('2015-07-06')) 
df_merged4['CHAIN_LENGTH'] = df_merged4.drop(['PED','1_EFF','2_EFF','3_EFF','4_EFF','5_EFF'], axis=1).apply(lambda row: len(pd.unique(row)), axis=1) -3

Hopefully my code is understood and will help someone in the future.

这篇关于连续日期的 pandas 行分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆