pandas 根据另一列的日期时间值填充一列的缺失值 [英] Pandas fill missing values of a column based on the datetime values of another column

查看:154
本文介绍了 pandas 根据另一列的日期时间值填充一列的缺失值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python新手,这是我的第一个问题. 我试图找到类似SO问题的解决方案,例如这个问题这一个,以及

Python newbie here, this is my first question. I tried to find a solution on similar SO questions, like this one, this one, and also this one, but I think my problem is different.

这是我的情况:我有一个很大的数据集,其中包含两列: Date (日期时间对象)和 session_id (整数).时间戳是指在线会话期间发生特定操作的时刻.

Here's my situation: I have a quite large dataset with two columns: Date (datetime object), and session_id (integer). The timestamps refer to the moment where a certain action occurred during an online session.

我的问题是我拥有所有日期,但是缺少一些相应的session_id值.我想做的是使用日期列来填充这些缺失的值:

My problem is that I have all the dates, but I am missing some of the corresponding session_id values. What I would like to do is to fill these missing values using the date column:

  1. 如果该操作在某个会话的第一个日期和最后一个日期之间发生,我想用该会话的ID填充缺失的值.
  2. 我将操作发生在任何会话范围之外的会话标记为"0"-
  3. 如果无法将事件与单个会话相关联,则将其标记为"-99",因为它发生在不同会话的时间范围内.

举一个我的问题的例子,让我们考虑下面的玩具数据集,其中我只有三个会话:a,b,c.会话a和b记录了三个事件,会话c记录了两个事件.而且,我缺少三个id值.

To give an example of my problem, let's consider the toy dataset below, where I have just three sessions: a, b, c. Session a and b registered three events, session c two. Moreover, I have three missing id values.

   |       DATE          |sess_id|
----------------------------------
 0 | 2018-01-01 00:19:01 | a    | 
 1 | 2018-01-01 00:19:05 | b    | 
 2 | 2018-01-01 00:21:07 | a    |
 3 | 2018-01-01 00:22:07 | b    | 
 4 | 2018-01-01 00:25:09 | c    |         
 5 | 2018-01-01 00:25:11 | Nan  |
 6 | 2018-01-01 00:27:28 | c    | 
 7 | 2018-01-01 00:29:29 | a    | 
 8 | 2018-01-01 00:30:35 | Nan  | 
 9 | 2018-01-01 00:31:16 | b    | 
10 | 2018-01-01 00:35:22 | Nan  | 
...

[Image_Timeline example][1]

这就是我想要获得的:

   |       DATE          |sess_id|
----------------------------------
 0 | 2018-01-01 00:19:01 | a    | 
 1 | 2018-01-01 00:19:05 | b    | 
 2 | 2018-01-01 00:21:07 | a    |
 3 | 2018-01-01 00:22:07 | b    | 
 4 | 2018-01-01 00:25:09 | c    |         
 5 | 2018-01-01 00:25:11 | -99  |
 6 | 2018-01-01 00:27:28 | c    | 
 7 | 2018-01-01 00:29:29 | a    | 
 8 | 2018-01-01 00:30:35 | b    | 
 9 | 2018-01-01 00:31:16 | b    | 
10 | 2018-01-01 00:35:22 | 0    | 
...

这样,我将能够在没有会话代码的情况下恢复至少一些事件. 我认为也许要做的第一件事是计算两个新列,分别显示每个会话的第一次和最后一次时间值,如下所示:

In this way I will be able to recover at least some of the events without session code. I think that maybe the first thing to do is to compute two new columns showing the first and last time value for each session, something like that:

foo['last'] = foo.groupby('sess_id')['DATE'].transform(max) 
foo['firs'] = foo.groupby('SESSIONCODE')['DATE'].transform(min) 

然后使用第一时间"值检查会话ID未知的每个事件是否在该范围内.

And then use first-last time value to check whether each event whose session id is unknown falls withing that range.

推荐答案

您的直觉在我看来还不错,但是您不能以这种方式应用它,因为您的数据框foo的大小与数据框.您可以做的就是映射这样的值:

Your intuition seems fine by me, but you can't apply it this way since your dataframe foo doens't have the same size as your groupby dataframe. What you could do is map the values like this:

foo['last'] = foo.sess_id.map(foo.groupby('sess_id').DATE.max())
foo['first'] = foo.sess_id.map(foo.groupby('sess_id').DATE.min())

但是我认为没有必要,您可以直接使用groupby数据框.

But I don't think it's necessary, you can just use the groupby dataframe as such.

解决问题的一种方法可能是在sess_id列中查找缺少的值,然后将自定义函数应用于相应的日期:

A way to solve your problem could be to look for the missing values in sess_id column, and apply a custom function to the corresponding dates:

def my_custom_function(time):
    current_sessions = my_agg.loc[(my_agg['min']<time) & (my_agg['max']>time)]
    count = len(current_sessions)
    if count == 0:
        return 0
    if count > 1:
        return -99
    return current_sessions.index[0]

my_agg = foo.groupby('sess_id').DATE.agg([min,max])
foo.loc[foo.sess_id.isnull(),'sess_id'] = foo.loc[foo.sess_id.isnull(),'DATE'].apply(my_custom_function)

输出:

    DATE                    sess_id
0   2018-01-01 00:19:01     a
1   2018-01-01 00:19:05     b
2   2018-01-01 00:21:07     a
3   2018-01-01 00:22:07     b
4   2018-01-01 00:25:09     c
5   2018-01-01 00:25:11     -99
6   2018-01-01 00:27:28     c
7   2018-01-01 00:29:29     a
8   2018-01-01 00:30:35     b
9   2018-01-01 00:31:16     b
10  2018-01-01 00:35:22     0

尽管您在问题中发布的输出似乎包含错别字,但我认为它可以满足您的要求.

I think it performs what you are looking for, though the output you posted in your question seems to contain typos.

这篇关于 pandas 根据另一列的日期时间值填充一列的缺失值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆