pyspark 数据帧中的自定义排序 [英] Custom sorting in pyspark dataframes
问题描述
是否有任何推荐的方法可以为 pyspark 中的分类数据实现自定义排序?理想情况下,我正在寻找 Pandas 分类数据类型提供的功能.
Are there any recommended methods for implementing custom sort ordering for categorical data in pyspark? I'm ideally looking for the functionality the pandas categorical data type offers.
因此,给定具有 Speed
列的数据集,可能的选项是 [Super Fast"、Fast"、Medium"、Slow"]代码>.我想实现适合上下文的自定义排序.
So, given a dataset with a Speed
column, the possible options are ["Super Fast", "Fast", "Medium", "Slow"]
. I want to implement custom sorting that will fit the context.
如果我使用默认排序,类别将按字母顺序排序.Pandas 允许将列数据类型更改为 categorical 并且定义的一部分给出了自定义排序顺序:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html
If I use the default sorting the categories will be sorted alphabetically. Pandas allows to change the column data type to be categorical and part of the definition gives a custom sort order: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Categorical.html
推荐答案
您可以使用 orderBy
并使用 when
定义您的自定义排序:
You can use orderBy
and define your custom ordering using when
:
from pyspark.sql.functions col, when
df.orderBy(when(col("Speed") == "Super Fast", 1)
.when(col("Speed") == "Fast", 2)
.when(col("Speed") == "Medium", 3)
.when(col("Speed") == "Slow", 4)
)
这篇关于pyspark 数据帧中的自定义排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!