使用德国Umlaute对 pandas 数据框进行排序 [英] Sorting pandas dataframe with German Umlaute
问题描述
我有一个数据框,我想通过一列上的sort_values对其进行排序.
I have a dataframe which I want to sort via sort_values on one column.
问题是单词中的第一个字母是德国的爱慕乐.
Problem is there are German umlaute as first letter of the words.
比如苏黎世的 Österreich.
Like Österreich, Zürich.
哪个会分类到Österreich的苏黎世.它应该在苏黎世(Österreich),苏黎世(Sürich)进行排序.
Which will sort to Zürich, Österreich. It should be sorting Österreich, Zürich.
Ö应该在N到O之间.
我发现了如何使用语言环境和strxfrm在python中的列表中执行此操作.我可以直接在熊猫数据框中执行此操作吗?
I have found out how to do this with lists in python using locale and strxfrm. Can I do this in the pandas dataframe somehow directly?
编辑 :谢谢你.Stef的示例效果很好,以某种方式我有了Numbers,而他的Version与我的现实生活的Dataframe示例不兼容,因此我使用了Alexey的想法.我做了以下事情,可能您可以缩短它..:
Edit: Thank You. Stef example worked quite well, somehow I had Numbers where his Version did not work with my real life Dataframe example, so I used alexey's idea. I did the following, probably you can shorten this..:
df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b', 'v']})
#create index as column for joining later
df = df.reset_index(drop=False)
#convert int to str
df['location']=df['location'].astype(str)
#sort by location with umlaute
df_sort_index = df['location'].str.normalize('NFD').sort_values(ascending=True).reset_index(drop=False)
#drop location so we dont have it in both tables
df = df.drop('location', axis=1)
#inner join on index
new_df = pd.merge(df_sort_index, df, how='inner', on='index')
#drop index as column
new_df = new_df.drop('index', axis=1)
推荐答案
您可以使用Unicode NFD普通格式
you can use unicode NFD normal form
>>> names = pd.Series(['Österreich', 'Ost', 'S', 'N'])
>>> names.str.normalize('NFD').sort_values()
3 N
1 Ost
0 Österreich
2 S
dtype: object
# use result to rearrange a dataframe
>>> df[names.str.normalize('NFD').sort_values().index]
这并不是您想要的,但是要正确订购,您需要掌握语言知识(例如您提到的语言环境).
It's not quite what you wanted, but for proper ordering you need language knowladge (like locale you mentioned).
NFD 使用两个符号来表示变音符号,例如Ö
变为 O \ xcc \ x88
(您可以看到与 names.str.normalize('NFD').encode('utf-8')的区别
)
NFD employs two symbols for umlauts e.g. Ö
becomes O\xcc\x88
(you can see the difference with names.str.normalize('NFD').encode('utf-8')
)
这篇关于使用德国Umlaute对 pandas 数据框进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!