大 pandas 填充NA,但并不都是基于最近的记录 [英] pandas fill NA but not all based on recent past record

查看:16
本文介绍了大 pandas 填充NA,但并不都是基于最近的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有如下所示的数据帧

stud_name   act_qtr year    yr_qty  qtr mov_avg_full    mov_avg_2qtr_min_period
0   ABC Q2  2014    2014Q2  NaN NaN NaN
1   ABC Q1  2016    2016Q1  Q1  13.0    14.5
2   ABC Q4  2016    2016Q4  NaN NaN NaN
3   ABC Q4  2017    2017Q4  NaN NaN NaN
4   ABC Q4  2020    2020Q4  NaN NaN NaN

OP = pd.read_clipboard()

stud_name   qtr year    t_score p_score yr_qty  mov_avg_full    mov_avg_2qtr_min_period
0   ABC Q1  2014    10  11  2014Q1  10.000000   10.0
1   ABC Q1  2015    11  32  2015Q1  10.500000   10.5
2   ABC Q2  2015    13  45  2015Q2  11.333333   12.0
3   ABC Q3  2015    15  32  2015Q3  12.250000   14.0
4   ABC Q4  2015    17  21  2015Q4  13.200000   16.0
5   ABC Q1  2016    12  56  2016Q1  13.000000   14.5
6   ABC Q2  2017    312 87  2017Q2  55.714286   162.0
7   ABC Q3  2018    24  90  2018Q3  51.750000   168.0

df = pd.read_clipboard()

我想根据以下逻辑填充na()

例如:让我们取stud_name = ABC。他有多项犯罪记录。让我们将他的NA作为2020Q4。为了填补这一空白,我们从df中选取stud_name=ABC之前的最新记录(即2018Q3)。同样,如果我们取stud_name = ABC。他的另一张NA唱片是2014Q2。我们从df中选取stud_name=ABC之前(即2014Q1)的最新(先前)记录。我们需要根据yearqty值进行排序,以正确获取最新(先前)记录

我们需要为每个stud_name和大型数据集执行此操作

因此,我们填充mov_avg_fullmov_avg_2qtr_min_period

如果在df数据帧中没有要查看的以前的记录,则将NA保留为原样

我正在尝试下面这样的方法,但它不起作用,而且不正确

Filled = OP.merge(df,on=['stud_name'],how='left')
filled.sort_values(['year','Qty'],inplace=True)
filled['mov_avg_full'].fillna(Filled.groupby('stud_name']['mov_avg_full'].shift())
filled['mov_avg_2qtr_min_period'].fillna(Filled .groupby('stud_name']['mov_avg_2qtr_min_period'].shift())

我希望我的输出如下所示

推荐答案

在这种情况下,您可能希望使用append而不是merge。换句话说,您希望垂直连接,而不是水平连接。然后在按stud_nameyr_qtr对DataFrame进行排序后,可以对其使用groupbyfillna方法。

代码:

import pandas as pd

# Create the sample dataframes
import numpy as np
op = pd.DataFrame({'stud_name': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC'}, 'act_qtr': {0: 'Q2', 1: 'Q1', 2: 'Q4', 3: 'Q4', 4: 'Q4'}, 'year': {0: 2014, 1: 2016, 2: 2016, 3: 2017, 4: 2020}, 'yr_qty': {0: '2014Q2', 1: '2016Q1', 2: '2016Q4', 3: '2017Q4', 4: '2020Q4'}, 'qtr': {0: np.NaN, 1: 'Q1', 2: np.NaN, 3: np.NaN, 4: np.NaN}, 'mov_avg_full': {0: np.NaN, 1: 13.0, 2: np.NaN, 3: np.NaN, 4: np.NaN}, 'mov_avg_2qtr_min_period': {0: np.NaN, 1: 14.5, 2: np.NaN, 3: np.NaN, 4: np.NaN}})
df = pd.DataFrame({'stud_name': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'ABC', 6: 'ABC', 7: 'ABC'}, 'qtr': {0: 'Q1', 1: 'Q1', 2: 'Q2', 3: 'Q3', 4: 'Q4', 5: 'Q1', 6: 'Q2', 7: 'Q3'}, 'year': {0: 2014, 1: 2015, 2: 2015, 3: 2015, 4: 2015, 5: 2016, 6: 2017, 7: 2018}, 't_score': {0: 10, 1: 11, 2: 13, 3: 15, 4: 17, 5: 12, 6: 312, 7: 24}, 'p_score': {0: 11, 1: 32, 2: 45, 3: 32, 4: 21, 5: 56, 6: 87, 7: 90}, 'yr_qty': {0: '2014Q1', 1: '2015Q1', 2: '2015Q2', 3: '2015Q3', 4: '2015Q4', 5: '2016Q1', 6: '2017Q2', 7: '2018Q3'}, 'mov_avg_full': {0: 10.0, 1: 10.5, 2: 11.333333, 3: 12.25, 4: 13.2, 5: 13.0, 6: 55.714286, 7: 51.75}, 'mov_avg_2qtr_min_period': {0: 10.0, 1: 10.5, 2: 12.0, 3: 14.0, 4: 16.0, 5: 14.5, 6: 162.0, 7: 168.0}})

# Append df to op
dfa = op.append(df[['stud_name', 'yr_qty', 'mov_avg_full', 'mov_avg_2qtr_min_period']])

# Sort before applying fillna
dfa = dfa.sort_values(['stud_name', 'yr_qty'])

# Group by stud_name and apply ffill
dfa[['mov_avg_full', 'mov_avg_2qtr_min_period']] = dfa.groupby('stud_name')[['mov_avg_full', 'mov_avg_2qtr_min_period']].fillna(method='ffill')

# Extract the orginal rows from op and deal with columns
dfa = dfa[dfa.act_qtr.notna()].drop('qtr', axis=1)

print(dfa)

输出:

stud_name act_qtr 年份 yr_qty mov_avg_Full mov_avg_2qtr_min_Period
ABC 第二季度 2014年 2014Q2 10 10
ABC 第一季度 2016 2016Q1 13 14.5
ABC 第四季度 2016 2016Q4 13 14.5
ABC 第四季度 2017 2017Q4 55.7143 162
ABC 第四季度 2020 2020Q4 51.75 168

这篇关于大 pandas 填充NA,但并不都是基于最近的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆