pandas :填充通过groupby对象迭代的缺失值 [英] Pandas: filling missing values iterating through a groupby object

查看:155
本文介绍了 pandas :填充通过groupby对象迭代的缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下数据集:

d = {'player': ['1', '1', '1', '1', '1', '1', '1', '1', '1', '2', '2', 
'2', '2', '2', '2', '3', '3', '3', '3', '3'],
'session': ['a', 'a', 'b', np.nan, 'b', 'c', 'c', 'c', 'c', 'd', 'd', 
'e', 'e', np.nan, 'e', 'f', 'f', 'g', np.nan,  'g'],
'date': ['2018-01-01 00:19:05', '2018-01-01 00:21:07', 
'2018-01-01 00:22:07', '2018-01-01 00:22:15','2018-01-01 00:25:09', 
'2018-01-01 00:25:11', '2018-01-01 00:27:28', '2018-01-01 00:29:29', 
'2018-01-01 00:30:35', '2018-01-01 00:21:16', '2018-01-01 00:35:22', 
'2018-01-01 00:38:16', '2018-01-01 00:38:20', '2018-01-01 00:40:35', 
'2018-01-01 01:31:16', '2018-01-03 00:55:22', '2018-01-03 00:58:16', 
'2018-01-03 00:58:21', '2018-03-01 01:00:35', '2018-03-01 01:31:16']
}

#create dataframe
df = pd.DataFrame(data=d)
#change date to datetime
df['date'] =  pd.to_datetime(df['date']) 

df.head()

     player session        date
0       1       a 2018-01-01 00:19:05
1       1       a 2018-01-01 00:21:07
2       1       b 2018-01-01 00:22:07
3       1     NaN 2018-01-01 00:22:15
4       1       b 2018-01-01 00:25:09

所以,这是我的三列:

  1. 玩家" -具有三名玩家(1,2,3)-dtype = object
  2. 会话" (对象).每个会话ID将玩家在线执行的一组动作(即数据集中的行)分组在一起.
  3. 日期" (日期时间对象)告诉我们实施每个操作的时间.
  1. 'player' - with three players (1,2,3) - dtype = object
  2. 'session' (object). Each session id groups together a set of actions (i.e. the rows in the dataset) that the players have implemented online.
  3. 'date' (datetime object) tells us the time at which each action was implemented.

此数据集中的问题是我为每个操作都添加了时间戳,但是某些操作缺少其会话ID.我要执行的操作如下:对于每个玩家,我都希望根据时间轴为缺少的值提供一个ID标签.如果缺少动作ID的动作属于某个会话的时间范围(第一个动作-最后一个动作),则可以对其进行标记.

The problem in this dataset is that I have the timestamps for each action, but some of the actions are missing their session id. What I want to do is the following: for each player, I want to give an id label for the missing values, based on the timeline. The actions missing their id can be labeled if they fall within the temporal range (first action - last action) of a certain session.

比方说,我按玩家分组id,并计算每个会话的时间范围:

Let's say I groupby player & id, and compute the time range for each session:

my_agg = df.groupby(['player', 'session']).date.agg([min, max])
my_agg

                           min                 max
player session                                        
1      a       2018-01-01 00:19:05 2018-01-01 00:21:07
       b       2018-01-01 00:22:07 2018-01-01 00:25:09
       c       2018-01-01 00:25:11 2018-01-01 00:30:35
2      d       2018-01-01 00:21:16 2018-01-01 00:35:22
       e       2018-01-01 00:38:16 2018-01-01 01:31:16
3      f       2018-01-03 00:55:22 2018-01-03 00:58:16
       g       2018-01-03 00:58:21 2018-03-01 01:31:16

在这一点上,我想遍历每个玩家,并逐个会话比较我的nan值的时间戳,以查看它们的所属位置.

At this point I would like to iterate through every player, and to compare the timestamp of my nan values, session by session, to see where they belong.

所需的输出:在示例中,第一个Nan应标记为'b',第二个Nan应标记为'e'和最后一个为'g'.

Desired output: In the example, the first Nan should be labeled as 'b', the second one as 'e' and the last one as 'g'.

免责声明:几天前,我问了一个类似的问题

Disclaimer: I asked a similar question a few days ago (see here), and received a very good answer, but this time I must take into account another variable and I am again stuck. Indeed, the first steps in Python are exciting but very challenging.

推荐答案

您的示例已经排序,但是即使您的输入未排序,这也应该产生您想要的结果.如果此答案不能满足您的要求,请在确实违反您的要求的情况下,发布一个额外的(或经过修改的)示例数据帧,并带有预期的输出.

Your example is already sorted, however this should produce your desired result even in the event that your inputs are not sorted. If this answer does not satisfy your requirements, please post an additional (or modified) sample dataframe with an expected output where this does violate your requirements.

df.sort_values(['player','date']).fillna(method='ffill')

收益:

   player session                date
0       1       a 2018-01-01 00:19:05
1       1       a 2018-01-01 00:21:07
2       1       b 2018-01-01 00:22:07
3       1       b 2018-01-01 00:22:15
4       1       b 2018-01-01 00:25:09
5       1       c 2018-01-01 00:25:11
6       1       c 2018-01-01 00:27:28
7       1       c 2018-01-01 00:29:29
8       1       c 2018-01-01 00:30:35
9       2       d 2018-01-01 00:21:16
10      2       d 2018-01-01 00:35:22
11      2       e 2018-01-01 00:38:16
12      2       e 2018-01-01 00:38:20
13      2       e 2018-01-01 00:40:35
14      2       e 2018-01-01 01:31:16
15      3       f 2018-01-03 00:55:22
16      3       f 2018-01-03 00:58:16
17      3       g 2018-01-03 00:58:21
18      3       g 2018-03-01 01:00:35
19      3       g 2018-03-01 01:31:16

这篇关于 pandas :填充通过groupby对象迭代的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆