使用groupby选择最近的数据,希望追加一个返回数据日期的列 [英] Used groupby to select most recent data, want to append a column that returns the date of the data

查看:123
本文介绍了使用groupby选择最近的数据,希望追加一个返回数据日期的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我原本有一个数据框如下所示:

 行业人口农村土地百分比
国家日期
澳大利亚2017-01-01 NaN NaN $
2016-01-01 24.327571 18.898304 12
2015-01-01 25.396251 18.835267 12
2014-01-01 27.277007 18.834835 13
美国2017-01-01 NaN NaN NaN
2016-01-01 NaN 19.028231 NaN
2015-01-01 20.027274 19.212860 NaN
2014-01-01 20.867359 19.379071 NaN

我应用以下代码,为每个国家/地区的每个列提取最新数据在以下数据集中:

  df = df.groupby(level = 0).first()

行业人口农村土地%b b国b $ b澳大利亚24.327571 18.898304 12
美国20.027274 19.028231美元b $ b < b $ b

有没有办法添加显示数据年份的列?并且在同一国家的年份不同的情况下返回新数据框中的最旧年份的数据?因此,对于澳大利亚来说,2016年和美国将是2015年。理想情况下,数据框将如下所示:

 年份行业人口%农村土地
国家
澳大利亚2016 24.327571 18.898304 12
美国2015 20.027274 19.028231 NaN

$ b $我认为你需要 first year of non NaN

解决方案 c $ c> s行通过系列。 dropna.htmlrel =nofollow noreferrer> dropna 然后:

 ($ level $ 0).first()
df1 = level = 0).first()
df1.insert(0,'year',df1.rename(s).index)
#alternative
#df1.insert(0,'year ',df1.index.to_series()。map(s))
print(df1)
年行业人口
国家
澳大利亚2016 24.327571 18.898304
美国2015 20.027274 19.028231

添加 NaNs 日期列的另一个解决方案,并最后一次通过 dt.year

  df1 =(df.reset_index(level = 1)
.assign(date = lambda x :x ['date']。where(df.notnull()。all(1).values))
.groupby(level = 0).first()
.assign(date = lambda x :x ['date']。dt.year)
.rename(columns = {'date':'year'}))
print(df1)
年行业人口
国家
澳大利亚2016 24.327571 18.898304
美国2015 20.027274 19.028231

编辑:

  def f(x):
#check NaNs
m = x.isnull()
#remove all NaNs columns
m = m.loc [:,〜m.all()]
#非NaN行的第一个索引值
m = m [〜m.any(1)]。index [0] [1 ] .year
return(m)

s = df.groupby(level = 0).apply(f)
print
country
澳大利亚2016
美国2015
dtype:int64

df1 = df.groupby(level = 0).first()
df1.insert(0,'year ',df1.rename(s).index)
#alternative
#df1.insert(0,'year',df1.index.to_series()。map(s))
print (df1)
行业年人口占农村土地的百分比
国家
澳大利亚2016 24.327571 18.898304 12.0
美国2015 20.027274 19.028231 NaN


I originally had a dataframe that looked like this:

                                  industry    population %of rural land
        country       date        
        Australia     2017-01-01  NaN         NaN        NaN
                      2016-01-01  24.327571   18.898304  12
                      2015-01-01  25.396251   18.835267  12
                      2014-01-01  27.277007   18.834835  13
        United States 2017-01-01  NaN         NaN        NaN
                      2016-01-01  NaN         19.028231  NaN
                      2015-01-01  20.027274   19.212860  NaN
                      2014-01-01  20.867359   19.379071  NaN

I applied the following code which pulled the most recent data for each of the columns for each of the countries and resulted in the following dataset:

df = df.groupby(level=0).first()

               industry  population  %of rural land
country                             
Australia      24.327571   18.898304 12
United States  20.027274   19.028231 NaN

Is there any way to add a column that shows the year of the data as well? and in the case where the year is different for the same country to return the oldest year of the data in the new data frame? So for Australia, that would be 2016 and US that would be 2015. Ideally, the dataframe would look like this:

               year      industry  population  %of rural land
country                             
Australia      2016      24.327571   18.898304 12
United States  2015      20.027274   19.028231 NaN

解决方案

I think you need for first year of non NaNs rows create helper Series by dropna and then :

s = df.dropna().reset_index(level=1)['date'].dt.year.groupby(level=0).first()
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231

Another solution with add NaNs to date column and last get years by dt.year:

df1 = (df.reset_index(level=1)
        .assign(date=lambda x: x['date'].where(df.notnull().all(1).values))
        .groupby(level=0).first()
        .assign(date=lambda x: x['date'].dt.year)
        .rename(columns={'date':'year'}))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231

EDIT:

def f(x):
    #check NaNs
    m = x.isnull()
    #remove all NaNs columns 
    m = m.loc[:, ~m.all()]
    #first index value of non NaNs rows
    m = m[~m.any(1)].index[0][1].year
    return (m)

s = df.groupby(level=0).apply(f)
print (s)
country
Australia        2016
United States    2015
dtype: int64

df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population  %of rural land
country                                                   
Australia      2016  24.327571   18.898304            12.0
United States  2015  20.027274   19.028231             NaN

这篇关于使用groupby选择最近的数据,希望追加一个返回数据日期的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆