使用groupby选择最近的数据，希望追加一个返回数据日期的列 [英] Used groupby to select most recent data, want to append a column that returns the date of the data

查看：123 发布时间：2018/5/30 14:15:17 python group-by pandas-groupby

本文介绍了使用groupby选择最近的数据，希望追加一个返回数据日期的列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我原本有一个数据框如下所示：

 行业人口农村土地百分比
国家日期
澳大利亚2017-01-01 NaN NaN $ 
 2016-01-01 24.327571 18.898304 12 
 2015-01-01 25.396251 18.835267 12 
 2014-01-01 27.277007 18.834835 13 
美国2017-01-01 NaN NaN NaN 
 2016-01-01 NaN 19.028231 NaN 
 2015-01-01 20.027274 19.212860 NaN 
 2014-01-01 20.867359 19.379071 NaN

我应用以下代码，为每个国家/地区的每个列提取最新数据在以下数据集中：

  df = df.groupby（level = 0）.first（）
 
行业人口农村土地％b b国b $ b澳大利亚24.327571 18.898304 12 
美国20.027274 19.028231美元b $ b < b $ b 有没有办法添加显示数据年份的列？并且在同一国家的年份不同的情况下返回新数据框中的最旧年份的数据？因此，对于澳大利亚来说，2016年和美国将是2015年。理想情况下，数据框将如下所示： 
 
 
 年份行业人口％农村土地
国家
澳大利亚2016 24.327571 18.898304 12 
美国2015 20.027274 19.028231 NaN 
  
 $ b $我认为你需要 first  year of non  NaN 解决方案 c $ c> s行通过系列。 dropna.htmlrel =nofollow noreferrer>  dropna  然后： 
 
  （$ level $ 0）.first（）
 df1 = level = 0）.first（）
 df1.insert（0，'year'，df1.rename（s）.index）
 #alternative 
＃df1.insert（0，'year '，df1.index.to_series（）。map（s））
 print（df1）
年行业人口
国家
澳大利亚2016 24.327571 18.898304 
美国2015 20.027274 19.028231 
  
添加 NaNs 至日期列的另一个解决方案，并最后一次通过  dt.year  ：
  df1 =（df.reset_index（level = 1）
 .assign（date = lambda x ：x ['date']。where（df.notnull（）。all（1）.values））
 .groupby（level = 0）.first（）
 .assign（date = lambda x ：x ['date']。dt.year）
 .rename（columns = {'date'：'year'}））
 print（df1）
年行业人口
国家
澳大利亚2016 24.327571 18.898304 
美国2015 20.027274 19.028231 
  编辑：  
 
 
  def f（x）：
 #check NaNs 
m = x.isnull（）
 #remove all NaNs columns 
m = m.loc [:,〜m.all（）] 
＃非NaN行的第一个索引值
m = m [〜m.any（1）]。index [0] [1 ] .year 
 return（m）
 
s = df.groupby（level = 0）.apply（f）
 print 
 country 
澳大利亚2016 
美国2015 
 dtype：int64 
 
 df1 = df.groupby（level = 0）.first（）
 df1.insert（0，'year '，df1.rename（s）.index）
 #alternative 
＃df1.insert（0，'year'，df1.index.to_series（）。map（s））
 print （df1）
行业年人口占农村土地的百分比
国家
澳大利亚2016 24.327571 18.898304 12.0 
美国2015 20.027274 19.028231 NaN 
  
 
I originally had a dataframe that looked like this:
                                  industry    population %of rural land
        country       date        
        Australia     2017-01-01  NaN         NaN        NaN
                      2016-01-01  24.327571   18.898304  12
                      2015-01-01  25.396251   18.835267  12
                      2014-01-01  27.277007   18.834835  13
        United States 2017-01-01  NaN         NaN        NaN
                      2016-01-01  NaN         19.028231  NaN
                      2015-01-01  20.027274   19.212860  NaN
                      2014-01-01  20.867359   19.379071  NaN
I applied the following code which pulled the most recent data for each of the columns for each of the countries and resulted in the following dataset:
df = df.groupby(level=0).first()

               industry  population  %of rural land
country                             
Australia      24.327571   18.898304 12
United States  20.027274   19.028231 NaN
Is there any way to add a column that shows the year of the data as well? and in the case where the year is different for the same country to return the oldest year of the data in the new data frame? So for Australia, that would be 2016 and US that would be 2015. Ideally, the dataframe would look like this:
               year      industry  population  %of rural land
country                             
Australia      2016      24.327571   18.898304 12
United States  2015      20.027274   19.028231 NaN

 解决方案 
I think you need for first year of non NaNs rows create helper Series by dropna and then :
s = df.dropna().reset_index(level=1)['date'].dt.year.groupby(level=0).first()
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231
Another solution with add NaNs to date column and last get years by dt.year:
df1 = (df.reset_index(level=1)
        .assign(date=lambda x: x['date'].where(df.notnull().all(1).values))
        .groupby(level=0).first()
        .assign(date=lambda x: x['date'].dt.year)
        .rename(columns={'date':'year'}))
print (df1)
               year   industry  population
country                                   
Australia      2016  24.327571   18.898304
United States  2015  20.027274   19.028231
EDIT:
def f(x):
    #check NaNs
    m = x.isnull()
    #remove all NaNs columns 
    m = m.loc[:, ~m.all()]
    #first index value of non NaNs rows
    m = m[~m.any(1)].index[0][1].year
    return (m)

s = df.groupby(level=0).apply(f)
print (s)
country
Australia        2016
United States    2015
dtype: int64

df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
               year   industry  population  %of rural land
country                                                   
Australia      2016  24.327571   18.898304            12.0
United States  2015  20.027274   19.028231             NaN


                        
这篇关于使用groupby选择最近的数据，希望追加一个返回数据日期的列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！
                        
                    

                    
                        查看全文

使用groupby选择最近的数据，希望追加一个返回数据日期的列 [英] Used groupby to select most recent data, want to append a column that returns the date of the data

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用groupby选择最近的数据，希望追加一个返回数据日期的列 [英] Used groupby to select most recent data, want to append a column that returns the date of the data

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭