按月份名称对 pandas 数据框系列进行排序 [英] Sort a pandas dataframe series by month name
问题描述
我有一个 Series 对象,它具有:
日期价格12 月 12 日5月15日4月13日..
问题陈述:我想让它按月出现并计算每个月的平均价格并按月排序.
期望的输出:
月 mean_price一月XXX二月XXX三月XXX
我想制作一个列表并将其传递给排序函数:
months = [Jan"、Feb"、Mar"、Apr"、May"、Jun"、Jul"、Aug"、Sep"、Oct"、十一月",十二月"]
但 sort_values 不支持系列.
我遇到的一个大问题是即使
df = df.sort_values(by='date',ascending=True,inplace=True)
有效到最初的 df
但在我做了一个 groupby
之后,它没有保持从排序的 df
出来的顺序.
总而言之,我需要从初始数据框中获得这两列.使用月份 (dt.strftime('%B')) 对 datetime 列进行排序并通过 groupby 排序变得混乱.现在我必须按月份名称对其进行排序.
<小时>我的代码:
df # 有 5 列,但我需要列日期"和价格"df.sort_values(by='date',inplace=True) #这部分是按日期排序的,很好total=(df.groupby(df['date'].dt.strftime('%B'))['price'].mean()) # 虽然现在不是原来的样子,而是月份按字母顺序出现
感谢 @Brad Solomon 提供一种更快的字符串大写方法!
注意 1 @Brad Solomon 的回答使用 pd.categorical
应该比我的答案更节省您的资源.他展示了如何为您的分类数据分配顺序.你不应该错过它:P
或者,您可以使用.
df = pd.DataFrame([[dec", 12], [jan", 40], [mar", 11], [aug", 21],[八月", 11], [一月", 11], [一月", 1]],列=[月",价格"])# 预处理:将`jan`、`dec` 大写为`Jan` 和`Dec`df[月"] = df[月"].str.capitalize()# 现在数据集应该看起来像# 月价# -----------# 十二月二十#一月XX# 4 月 XX 日# 将其设为日期时间,以便我们对其进行排序:# 使用 %b 因为数据使用月份的缩写df[月"] = pd.to_datetime(df.Month, format='%b', errors='coerce').dt.monthdf = df.sort_values(by=月")总计 = (df.groupby(df['Month'])['Price'].mean())# 全部的月1 17.3333333 11.0000008 16.00000012 12.000000
注意事项 2groupby
默认会为你排序组键.请注意在 df = df.sort_values(by=SAME_KEY)
和 total = (df.groupby(df[SAME_KEY])['Price'].mean()).
否则,可能会出现意外行为.请参阅 Groupby 保留组之间的顺序?以哪种方式?了解更多信息.
注意 3一种计算效率更高的方法是先计算均值,然后按月进行排序.这样,您只需要对 12 个项目而不是整个 df
进行排序.如果不需要df
进行排序,将减少计算成本.
注意 4 如果人们已经将month
作为索引,并且想知道如何将其分类,请查看 pandas.CategoricalIndex
@jezrael 有一个在 Pandas 系列按月索引排序 中制作分类索引的工作示例>
I have a Series object that has:
date price
dec 12
may 15
apr 13
..
Problem statement: I want to make it appear by month and compute the mean price for each month and present it with a sorted manner by month.
Desired Output:
month mean_price
Jan XXX
Feb XXX
Mar XXX
I thought of making a list and passing it in a sort function:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
but the sort_values doesn't support that for series.
One big problem I have is that even though
df = df.sort_values(by='date',ascending=True,inplace=True)
works
to the initial df
but after I did a groupby
, it didn't maintain the order coming out from the sorted df
.
To conclude, I needed from the initial data frame these two columns. Sorted the datetime column and through a groupby using the month (dt.strftime('%B')) the sorting got messed up. Now I have to sort it by month name.
My code:
df # has 5 columns though I need the column 'date' and 'price'
df.sort_values(by='date',inplace=True) #at this part it is sorted according to date, great
total=(df.groupby(df['date'].dt.strftime('%B'))['price'].mean()) # Though now it is not as it was but instead the months appear alphabetically
Thanks @Brad Solomon for offering a faster way to capitalize string!
Note 1 @Brad Solomon's answer using pd.categorical
should save your resources more than my answer. He showed how to assign order to your categorical data. You should not miss it :P
Alternatively, you can use.
df = pd.DataFrame([["dec", 12], ["jan", 40], ["mar", 11], ["aug", 21],
["aug", 11], ["jan", 11], ["jan", 1]],
columns=["Month", "Price"])
# Preprocessing: capitalize `jan`, `dec` to `Jan` and `Dec`
df["Month"] = df["Month"].str.capitalize()
# Now the dataset should look like
# Month Price
# -----------
# Dec XX
# Jan XX
# Apr XX
# make it a datetime so that we can sort it:
# use %b because the data use the abbreviation of month
df["Month"] = pd.to_datetime(df.Month, format='%b', errors='coerce').dt.month
df = df.sort_values(by="Month")
total = (df.groupby(df['Month'])['Price'].mean())
# total
Month
1 17.333333
3 11.000000
8 16.000000
12 12.000000
Note 2
groupby
by default will sort group keys for you. Be aware to use the same key to sort and groupby in the df = df.sort_values(by=SAME_KEY)
and total = (df.groupby(df[SAME_KEY])['Price'].mean()).
Otherwise, one may gets unintended behavior. See Groupby preserve order among groups? In which way? for more information.
Note 3
A more computationally efficient way is first compute mean and then do sorting on months. In this way, you only need to sort on 12 items rather than the whole df
. It will reduce the computational cost if one don't need df
to be sorted.
Note 4 For people already have month
as index, and wonder how to make it categorical, take a look at pandas.CategoricalIndex
@jezrael has a working example on making categorical index ordered in Pandas series sort by month index
这篇关于按月份名称对 pandas 数据框系列进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!