如何跟踪 pandas 数据框中的前一个日期记录列? [英] How to keep track of previous date record column in pandas dataframe?
问题描述
此问题是从我想对pandas Dataframe进行一些数据分析.我有一个如下数据框:
I want to perform some data analysis on pandas Dataframe. I have one dataframe like below:
derived_symbol sport_name person_name city \
0 football.RAM.mumbai.ram_count football RAM mumbai
1 football.RAM.mumbai.mum_count football RAM mumbai
2 football.RAM.delhi.mum_count football RAM delhi
3 football.RAM.delhi.ram_count football RAM delhi
4 football.RAM.mumbai.ram_count football RAM mumbai
5 football.RAM.mumbai.mum_count football RAM mumbai
6 football.RAM.delhi.mum_count football RAM delhi
7 football.RAM.delhi.ram_count football RAM delhi
8 basketball.MAH.pune.mah_count basketball MAH pune
9 basketball.MAH.nagpur.mah_count basketball MAH nagpur
10 basketball.MAH.TOTAL.mah_count basketball MAH No Entry
11 basketball.MAH.TOTAL.nagpur_count basketball MAH nagpur
12 basketball.MAH.TOTAL.pune_count basketball MAH pune
13 football.RAM.TOTAL.delhi_count football RAM delhi
14 football.RAM.TOTAL.delhi_count football RAM delhi
15 football.RAM.TOTAL.mum_count football RAM No Entry
16 football.RAM.TOTAL.mum_count football RAM No Entry
17 football.RAM.TOTAL.mumbai_count football RAM mumbai
18 football.RAM.TOTAL.mumbai_count football RAM mumbai
19 football.RAM.TOTAL.ram_count football RAM No Entry
20 football.RAM.TOTAL.ram_count football RAM No Entry
person_symbol month sir person_count
0 ram 2017-01-23 a 10
1 mum 2017-01-23 a 14
2 mum 2017-01-23 a 25
3 ram 2017-01-23 a 20
4 ram 2017-02-22 b 34
5 mum 2017-02-22 b 23
6 mum 2017-02-22 b 43
7 ram 2017-02-22 b 34
8 mah 2017-03-03 c 10
9 mah 2017-03-03 c 20
10 mah 2017-03-03 c 30
11 No Entry 2017-03-03 c 20
12 No Entry 2017-03-03 c 10
13 No Entry 2017-01-23 a 45
14 No Entry 2017-02-22 b 77
15 mum 2017-01-23 a 39
16 mum 2017-02-22 b 66
17 No Entry 2017-01-23 a 24
18 No Entry 2017-02-22 b 57
19 ram 2017-01-23 a 30
20 ram 2017-02-22 b 68
我想向此数据框添加previous_person_count列.此数据框的月"列包含"yyyy-mm-dd"格式的日期.因此,我们需要查看月份,即"mm"字段,以确定它是哪个月份.
I want to add previous_person_count column to this Dataframe. "month" column of this dataframe contains date in the format of "yyyy-mm-dd". So we need to look at the month i.e. "mm" field to identify which month it is.
看到这个月,我们需要将"person_count"值放入下个月的"previous_person_count"值.
By seeing this month we need to put "person_count" value into the next month's "previous_person_count" value.
超量输出:
derived_symbol sport_name person_name city \
0 football.RAM.mumbai.ram_count football RAM mumbai
1 football.RAM.mumbai.mum_count football RAM mumbai
2 football.RAM.delhi.mum_count football RAM delhi
3 football.RAM.delhi.ram_count football RAM delhi
4 football.RAM.mumbai.ram_count football RAM mumbai
5 football.RAM.mumbai.mum_count football RAM mumbai
6 football.RAM.delhi.mum_count football RAM delhi
7 football.RAM.delhi.ram_count football RAM delhi
8 basketball.MAH.pune.mah_count basketball MAH pune
9 basketball.MAH.nagpur.mah_count basketball MAH nagpur
10 basketball.MAH.TOTAL.mah_count basketball MAH No Entry
11 basketball.MAH.TOTAL.nagpur_count basketball MAH nagpur
12 basketball.MAH.TOTAL.pune_count basketball MAH pune
13 football.RAM.TOTAL.delhi_count football RAM delhi
14 football.RAM.TOTAL.delhi_count football RAM delhi
15 football.RAM.TOTAL.mum_count football RAM No Entry
16 football.RAM.TOTAL.mum_count football RAM No Entry
17 football.RAM.TOTAL.mumbai_count football RAM mumbai
18 football.RAM.TOTAL.mumbai_count football RAM mumbai
19 football.RAM.TOTAL.ram_count football RAM No Entry
20 football.RAM.TOTAL.ram_count football RAM No Entry
person_symbol month sir person_count previous_person_count
0 ram 2017-01-23 a 10 0
1 mum 2017-01-23 a 14 0
2 mum 2017-01-23 a 25 0
3 ram 2017-01-23 a 20 0
4 ram 2017-02-22 b 34 10
5 mum 2017-02-22 b 23 14
6 mum 2017-02-22 b 43 25
7 ram 2017-02-22 b 34 20
8 mah 2017-03-03 c 10 0
9 mah 2017-03-03 c 20 0
10 mah 2017-03-03 c 30 0
11 No Entry 2017-03-03 c 20 0
12 No Entry 2017-03-03 c 10 0
13 No Entry 2017-01-23 a 45 0
14 No Entry 2017-02-22 b 77 45
15 mum 2017-01-23 a 39 0
16 mum 2017-02-22 b 66 39
17 No Entry 2017-01-23 a 24 0
18 No Entry 2017-02-22 b 57 24
19 ram 2017-01-23 a 30 0
20 ram 2017-02-22 b 68 30
编辑参考代码:
df = pd.DataFrame({'sport_name': ['football','football','football','football','football','football','football','football','basketball','basketball'],
'person_name': ['ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','ramesh','mahesh','mahesh'],
'city': ['mumbai', 'mumbai','delhi','delhi','mumbai', 'mumbai','delhi','delhi','pune','nagpur'],
'person_symbol': ['ram','mum','mum','ram','ram','mum','mum','ram','mah','mah'],
'person_count': ['10','14','25','20','34','23','43','34','10','20'],
'month': ['2017-01-23','2017-01-23','2017-01-23','2017-01-23','2017-02-22','2017-02-22','2017-02-22','2017-02-22','2017-03-03','2017-03-03'],
'sir': ['a','a','a','a','b','b','b','b','c','c']})
df = df[['sport_name','person_name','city','person_symbol','person_count','month','sir']]
df['person_name'] = df['person_name'].apply(symbology)
df['person_count'] = df['person_count'].astype(int)
print df
df1=df.set_index(['sport_name','person_name','person_count','month','sir']).stack().reset_index(name='val')
df1['derived_symbol'] = df1['sport_name'] + '.' + df1['person_name'] + '.TOTAL.' + df1['val'] + '_count'
df2 = df1.groupby(['derived_symbol','month','sir','sport_name','person_name','level_5','val'])['person_count'].sum().reset_index(name='person_count')
df3 = df2.set_index(['derived_symbol','month','sir','sport_name','person_name','person_count','level_5'])['val'].unstack().fillna('No Entry').rename_axis(None, 1).reset_index()
df['derived_symbol'] = df['sport_name'] + '.' + df['person_name'] + '.' + df['city'] + "."+ df['person_symbol'] + '_count'
df4 = pd.concat([df, df3]).reset_index(None)
print df3
del df4['index']
df4 = df4[['derived_symbol','sport_name','person_name','city','person_symbol','month','sir','person_count']]
print df4
便利:
Convenience:
d = {'city': {0: 'mumbai',
1: 'mumbai',
2: 'delhi',
3: 'delhi',
4: 'mumbai',
5: 'mumbai',
6: 'delhi',
7: 'delhi',
8: 'pune',
9: 'nagpur',
10: 'No Entry',
11: 'nagpur',
12: 'pune',
13: 'delhi',
14: 'delhi',
15: 'No Entry',
16: 'No Entry',
17: 'mumbai',
18: 'mumbai',
19: 'No Entry',
20: 'No Entry'},
'derived_symbol': {0: 'football.RAM.mumbai.ram_count',
1: 'football.RAM.mumbai.mum_count',
2: 'football.RAM.delhi.mum_count',
3: 'football.RAM.delhi.ram_count',
4: 'football.RAM.mumbai.ram_count',
5: 'football.RAM.mumbai.mum_count',
6: 'football.RAM.delhi.mum_count',
7: 'football.RAM.delhi.ram_count',
8: 'basketball.MAH.pune.mah_count',
9: 'basketball.MAH.nagpur.mah_count',
10: 'basketball.MAH.TOTAL.mah_count',
11: 'basketball.MAH.TOTAL.nagpur_count',
12: 'basketball.MAH.TOTAL.pune_count',
13: 'football.RAM.TOTAL.delhi_count',
14: 'football.RAM.TOTAL.delhi_count',
15: 'football.RAM.TOTAL.mum_count',
16: 'football.RAM.TOTAL.mum_count',
17: 'football.RAM.TOTAL.mumbai_count',
18: 'football.RAM.TOTAL.mumbai_count',
19: 'football.RAM.TOTAL.ram_count',
20: 'football.RAM.TOTAL.ram_count'},
'month': {0: '2017-01-23',
1: '2017-01-23',
2: '2017-01-23',
3: '2017-01-23',
4: '2017-02-22',
5: '2017-02-22',
6: '2017-02-22',
7: '2017-02-22',
8: '2017-03-03',
9: '2017-03-03',
10: '2017-03-03',
11: '2017-03-03',
12: '2017-03-03',
13: '2017-01-23',
14: '2017-02-22',
15: '2017-01-23',
16: '2017-02-22',
17: '2017-01-23',
18: '2017-02-22',
19: '2017-01-23',
20: '2017-02-22'},
'person_count': {0: 10,
1: 14,
2: 25,
3: 20,
4: 34,
5: 23,
6: 43,
7: 34,
8: 10,
9: 20,
10: 30,
11: 20,
12: 10,
13: 45,
14: 77,
15: 39,
16: 66,
17: 24,
18: 57,
19: 30,
20: 68},
'person_name': {0: 'RAM',
1: 'RAM',
2: 'RAM',
3: 'RAM',
4: 'RAM',
5: 'RAM',
6: 'RAM',
7: 'RAM',
8: 'MAH',
9: 'MAH',
10: 'MAH',
11: 'MAH',
12: 'MAH',
13: 'RAM',
14: 'RAM',
15: 'RAM',
16: 'RAM',
17: 'RAM',
18: 'RAM',
19: 'RAM',
20: 'RAM'},
'person_symbol': {0: 'ram',
1: 'mum',
2: 'mum',
3: 'ram',
4: 'ram',
5: 'mum',
6: 'mum',
7: 'ram',
8: 'mah',
9: 'mah',
10: 'mah',
11: 'No Entry',
12: 'No Entry',
13: 'No Entry',
14: 'No Entry',
15: 'mum',
16: 'mum',
17: 'No Entry',
18: 'No Entry',
19: 'ram',
20: 'ram'},
'sir': {0: 'a',
1: 'a',
2: 'a',
3: 'a',
4: 'b',
5: 'b',
6: 'b',
7: 'b',
8: 'c',
9: 'c',
10: 'c',
11: 'c',
12: 'c',
13: 'a',
14: 'b',
15: 'a',
16: 'b',
17: 'a',
18: 'b',
19: 'a',
20: 'b'},
'sport_name': {0: 'football',
1: 'football',
2: 'football',
3: 'football',
4: 'football',
5: 'football',
6: 'football',
7: 'football',
8: 'basketball',
9: 'basketball',
10: 'basketball',
11: 'basketball',
12: 'basketball',
13: 'football',
14: 'football',
15: 'football',
16: 'football',
17: 'football',
18: 'football',
19: 'football',
20: 'football'}}
推荐答案
What you can do is merge
the dataframe into itself, after computing the month number (from the date), and the previous one as well.
让我们开始计算这两个值.为了方便起见,我首先将原始的month
字符串值转换为datetime,这使我可以使用
Let's start with computing those 2 values. For convenience purposes, I firstly converted the raw month
string value to datetime, which allowed me to use relativedelta
to compute the previous month. This ensures behaviour is correct, even after a change of year.
In [7]: df['month'] = pd.to_datetime(df['month'])
In [8]: df['month_num'] = df['month'].apply(lambda x: x.strftime('%Y-%m'))
In [9]: from dateutil.relativedelta import relativedelta
In [10]: df['previous_month_num'] = df['month'].apply(lambda x: (x + relativedelta(months=-1)).strftime('%Y-%m'))
In [11]: df
Out[11]:
city month person_count person_name person_symbol sir sport_name \
0 mumbai 2017-01-23 10 ramesh ram a football
1 mumbai 2017-01-23 14 ramesh mum a football
2 delhi 2017-01-23 25 ramesh mum a football
3 delhi 2017-01-23 20 ramesh ram a football
4 mumbai 2017-02-22 34 ramesh ram b football
5 mumbai 2017-02-22 23 ramesh mum b football
6 delhi 2017-02-22 43 ramesh mum b football
7 delhi 2017-02-22 34 ramesh ram b football
8 pune 2017-03-03 10 mahesh mah c basketball
9 nagpur 2017-03-03 20 mahesh mah c basketball
month_num previous_month_num
0 2017-01 2016-12
1 2017-01 2016-12
2 2017-01 2016-12
3 2017-01 2016-12
4 2017-02 2017-01
5 2017-02 2017-01
6 2017-02 2017-01
7 2017-02 2017-01
8 2017-03 2017-02
9 2017-03 2017-02
然后,我们可以使用计算的月份值作为合并键,将数据框合并到自身中:
We can then merge the dataframe into itself, using the computed month values as merging keys:
In [12]: relevant_columns = ['city', 'person_symbol', 'sport_name']
In [13]: pd.merge(df, df, left_on=relevant_columns + ['previous_month_num'], right_on=rele
...: vant_columns + ['month_num'], how='left', suffixes=('', '_previous'))[list(df.col
...: umns) + ['person_count_previous']].fillna(0).drop(['month_num', 'previous_month_n
...: um'], axis=1)
Out[13]:
city month person_count person_name person_symbol sir sport_name \
0 mumbai 2017-01-23 10 ramesh ram a football
1 mumbai 2017-01-23 14 ramesh mum a football
2 delhi 2017-01-23 25 ramesh mum a football
3 delhi 2017-01-23 20 ramesh ram a football
4 mumbai 2017-02-22 34 ramesh ram b football
5 mumbai 2017-02-22 23 ramesh mum b football
6 delhi 2017-02-22 43 ramesh mum b football
7 delhi 2017-02-22 34 ramesh ram b football
8 pune 2017-03-03 10 mahesh mah c basketball
9 nagpur 2017-03-03 20 mahesh mah c basketball
person_count_previous
0 0
1 0
2 0
3 0
4 10
5 14
6 25
7 20
8 0
9 0
一些评论:
- 我使用
['city', 'person_symbol', 'sport_name']
作为参考列,但可以随意添加更多内容,具体取决于您要实现的目标. - 新列名为
person_count_previous
,但是您可以fillna
. - 我使用
drop
,但随时保留它们.
- I used
['city', 'person_symbol', 'sport_name']
as the reference columns, but feel free to add some more, depending on what exactly you want to achieve. - The new column is named
person_count_previous
, but you canrename
it, should it be best for you. - By default, when there is no match for the previous count, the column will be NaN. I replaced the values with 0, thanks to
fillna
. - I removed the "temporary" columns using
drop
, but feel free to keep them.
这篇关于如何跟踪 pandas 数据框中的前一个日期记录列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!