使用德国Umlaute对 pandas 数据框进行排序 [英] Sorting pandas dataframe with German Umlaute

查看:55
本文介绍了使用德国Umlaute对 pandas 数据框进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,我想通过一列上的sort_values对其进行排序.

I have a dataframe which I want to sort via sort_values on one column.

问题是单词中的第一个字母是德国的爱慕乐.

Problem is there are German umlaute as first letter of the words.

比如苏黎世的 Österreich.

Like Österreich, Zürich.

哪个会分类到Österreich的苏黎世.它应该在苏黎世(Österreich),苏黎世(Sürich)进行排序.

Which will sort to Zürich, Österreich. It should be sorting Österreich, Zürich.

Ö应该在N到O之间.

我发现了如何使用语言环境和strxfrm在python中的列表中执行此操作.我可以直接在熊猫数据框中执行此操作吗?

I have found out how to do this with lists in python using locale and strxfrm. Can I do this in the pandas dataframe somehow directly?

编辑 :谢谢你.Stef的示例效果很好,以某种方式我有了Numbers,而他的Version与我的现实生活的Dataframe示例不兼容,因此我使用了Alexey的想法.我做了以下事情,可能您可以缩短它..:

Edit: Thank You. Stef example worked quite well, somehow I had Numbers where his Version did not work with my real life Dataframe example, so I used alexey's idea. I did the following, probably you can shorten this..:


df = pd.DataFrame({'location': ['Österreich','Zürich','Bern', 254345],'code':['ö','z','b', 'v']})

#create index as column for joining later
df = df.reset_index(drop=False)

#convert int to str
df['location']=df['location'].astype(str)

#sort by location with umlaute
df_sort_index = df['location'].str.normalize('NFD').sort_values(ascending=True).reset_index(drop=False)

#drop location so we dont have it in both tables
df = df.drop('location', axis=1)

#inner join on index
new_df = pd.merge(df_sort_index, df, how='inner', on='index')

#drop index as column
new_df = new_df.drop('index', axis=1)

推荐答案

您可以使用Unicode NFD普通格式

you can use unicode NFD normal form

>>> names = pd.Series(['Österreich', 'Ost', 'S', 'N'])
>>> names.str.normalize('NFD').sort_values()
3              N
1            Ost
0    Österreich
2              S
dtype: object

# use result to rearrange a dataframe
>>> df[names.str.normalize('NFD').sort_values().index]

这并不是您想要的,但是要正确订购,您需要掌握语言知识(例如您提到的语言环境).

It's not quite what you wanted, but for proper ordering you need language knowladge (like locale you mentioned).

NFD 使用两个符号来表示变音符号,例如Ö变为 O \ xcc \ x88 (您可以看到与 names.str.normalize('NFD').encode('utf-8')的区别)

NFD employs two symbols for umlauts e.g. Ö becomes O\xcc\x88 (you can see the difference with names.str.normalize('NFD').encode('utf-8'))

这篇关于使用德国Umlaute对 pandas 数据框进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆