使用groupby选择最近的数据,希望追加一个返回数据日期的列 [英] Used groupby to select most recent data, want to append a column that returns the date of the data
问题描述
我原本有一个数据框如下所示:
行业人口农村土地百分比
国家日期
澳大利亚2017-01-01 NaN NaN $
2016-01-01 24.327571 18.898304 12
2015-01-01 25.396251 18.835267 12
2014-01-01 27.277007 18.834835 13
美国2017-01-01 NaN NaN NaN
2016-01-01 NaN 19.028231 NaN
2015-01-01 20.027274 19.212860 NaN
2014-01-01 20.867359 19.379071 NaN
我应用以下代码,为每个国家/地区的每个列提取最新数据在以下数据集中:
df = df.groupby(level = 0).first()
行业人口农村土地%b b国b $ b澳大利亚24.327571 18.898304 12
美国20.027274 19.028231美元b $ b < b $ b 有没有办法添加显示数据年份的列?并且在同一国家的年份不同的情况下返回新数据框中的最旧年份的数据?因此,对于澳大利亚来说,2016年和美国将是2015年。理想情况下,数据框将如下所示:
年份行业人口%农村土地
国家
澳大利亚2016 24.327571 18.898304 12
美国2015 20.027274 19.028231 NaN
$ b $我认为你需要 first
year of non NaN $ b
解决方案 c $ c> s行通过系列。 dropna.htmlrel =nofollow noreferrer> dropna
然后:
($ level $ 0).first()
df1 = level = 0).first()
df1.insert(0,'year',df1.rename(s).index)
#alternative
#df1.insert(0,'year ',df1.index.to_series()。map(s))
print(df1)
年行业人口
国家
澳大利亚2016 24.327571 18.898304
美国2015 20.027274 19.028231
添加 NaNs
至日期
列的另一个解决方案,并最后一次通过 dt.year
:
df1 =(df.reset_index(level = 1)
.assign(date = lambda x :x ['date']。where(df.notnull()。all(1).values))
.groupby(level = 0).first()
.assign(date = lambda x :x ['date']。dt.year)
.rename(columns = {'date':'year'}))
print(df1)
年行业人口
国家
澳大利亚2016 24.327571 18.898304
美国2015 20.027274 19.028231
编辑:
def f(x):
#check NaNs
m = x.isnull()
#remove all NaNs columns
m = m.loc [:,〜m.all()]
#非NaN行的第一个索引值
m = m [〜m.any(1)]。index [0] [1 ] .year
return(m)
s = df.groupby(level = 0).apply(f)
print
country
澳大利亚2016
美国2015
dtype:int64
df1 = df.groupby(level = 0).first()
df1.insert(0,'year ',df1.rename(s).index)
#alternative
#df1.insert(0,'year',df1.index.to_series()。map(s))
print (df1)
行业年人口占农村土地的百分比
国家
澳大利亚2016 24.327571 18.898304 12.0
美国2015 20.027274 19.028231 NaN
I originally had a dataframe that looked like this:
industry population %of rural land
country date
Australia 2017-01-01 NaN NaN NaN
2016-01-01 24.327571 18.898304 12
2015-01-01 25.396251 18.835267 12
2014-01-01 27.277007 18.834835 13
United States 2017-01-01 NaN NaN NaN
2016-01-01 NaN 19.028231 NaN
2015-01-01 20.027274 19.212860 NaN
2014-01-01 20.867359 19.379071 NaN
I applied the following code which pulled the most recent data for each of the columns for each of the countries and resulted in the following dataset:
df = df.groupby(level=0).first()
industry population %of rural land
country
Australia 24.327571 18.898304 12
United States 20.027274 19.028231 NaN
Is there any way to add a column that shows the year of the data as well? and in the case where the year is different for the same country to return the oldest year of the data in the new data frame? So for Australia, that would be 2016 and US that would be 2015. Ideally, the dataframe would look like this:
year industry population %of rural land
country
Australia 2016 24.327571 18.898304 12
United States 2015 20.027274 19.028231 NaN
解决方案 I think you need for first
year of non NaN
s rows create helper Series
by dropna
and then :
s = df.dropna().reset_index(level=1)['date'].dt.year.groupby(level=0).first()
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
year industry population
country
Australia 2016 24.327571 18.898304
United States 2015 20.027274 19.028231
Another solution with add NaNs
to date
column and last get years by dt.year
:
df1 = (df.reset_index(level=1)
.assign(date=lambda x: x['date'].where(df.notnull().all(1).values))
.groupby(level=0).first()
.assign(date=lambda x: x['date'].dt.year)
.rename(columns={'date':'year'}))
print (df1)
year industry population
country
Australia 2016 24.327571 18.898304
United States 2015 20.027274 19.028231
EDIT:
def f(x):
#check NaNs
m = x.isnull()
#remove all NaNs columns
m = m.loc[:, ~m.all()]
#first index value of non NaNs rows
m = m[~m.any(1)].index[0][1].year
return (m)
s = df.groupby(level=0).apply(f)
print (s)
country
Australia 2016
United States 2015
dtype: int64
df1 = df.groupby(level=0).first()
df1.insert(0, 'year', df1.rename(s).index)
#alternative
#df1.insert(0, 'year', df1.index.to_series().map(s))
print (df1)
year industry population %of rural land
country
Australia 2016 24.327571 18.898304 12.0
United States 2015 20.027274 19.028231 NaN
这篇关于使用groupby选择最近的数据,希望追加一个返回数据日期的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文