在Dask中排序 [英] Sorting in Dask

查看:285
本文介绍了在Dask中排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在dask中找到 pandas.dataframe.sort_value 函数的替代方法。

我遇到过 set_index ,但是可以排序

I want to find an alternative of pandas.dataframe.sort_value function in dask.
I came through set_index, but it would sort on a single column.

如何对Dask数据框的多列进行排序?

How can I sort multiple columns of Dask data frame?

推荐答案

到目前为止,Dask似乎不支持按多列排序。但是,创建一个新的列以将已排序的列的值连接在一起可能是一种可行的解决方法。

So far Dask does not seem to support sorting by multiple columns. However, making a new column that concatenates the values of the sorted columns may be a usable work-around.

d['new_column'] = d.apply(lambda r: str([r.col1,r.col2]), axis=1)
d = d.set_index('new_column')
d = d.map_partitions(lambda x: x.sort_index())

编辑:
如果您要按两个字符串排序,则上述方法有效。我建议创建整数(或字节)列,然后使用 struct.pack 创建一个新的复合字节列。例如,如果 col1_dt 是日期时间,而 col2 是整数:

The above works if you want to sort by two strings. I recommend creating integer (or bytes) columns and then using struct.pack to create a new composite bytes column. For example, if col1_dt is a datetime and col2 is an integer:

import struct

# create a timedelta with seconds resolution. 
# i know this is the resolution is correct
d['col1_int'] = ((d['col1_dt'] -
                  d['col1_dt'].min())/np.timedelta64(1,'s')
                ).astype(int)

d['new_column'] = d.apply(lambda r: struct.pack("ll",r.col1_int,r.col2))
d = d.set_index('new_column')
d = d.map_partitions(lambda x: x.sort_index())

这篇关于在Dask中排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆