对大数据集按 pandas 排序 [英] Sorting in pandas for large datasets

查看:74
本文介绍了对大数据集按 pandas 排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想按给定的列(特别是p值)对数据进行排序.但是,问题是我无法将我的全部数据加载到内存中.因此,以下内容无效,或者仅适用于小型数据集.

I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets.

data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)

有没有一种快速的方法来按给定的列对数据进行排序,而该列仅考虑块,而无需将整个数据集加载到内存中?

Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?

推荐答案

过去,我曾经使用过Linux的一对 split 实用程序,以对阻塞大熊猫的大量文件进行排序.

In the past, I've used Linux's pair of venerable sort and split utilities, to sort massive files that choked pandas.

我不想贬低此页面上的其他答案.但是,由于您的数据是文本格式(如您在注释中所指出的),所以我认为开始将其转换为其他格式(HDF,SQL等)是巨大的麻烦,这是GNU/Linux实用程序一直在解决的问题在过去30-40年内有效地进行了

I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years.

假设您的文件名为stuff.csv,看起来像这样:

Say your file is called stuff.csv, and looks like this:

4.9,3.0,1.4,0.6
4.8,2.8,1.3,1.2

然后以下命令将其按第三列排序:

Then the following command will sort it by the 3rd column:

sort --parallel=8 -t . -nrk3 stuff.csv

请注意,此处的线程数设置为8.

Note that the number of threads here is set to 8.

以上内容适用于适合主存储器的文件.当文件太大时,您首先需要将其分为多个部分.所以

The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So

split -l 100000 stuff.csv stuff

将文件分成最多100000行的文件.

would split the file into files of length at most 100000 lines.

现在,如上所述,您将分别对每个文件进行排序.最后,您将再次使用 mergesort (通过等待...)sort:

Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) sort:

sort -m sorted_stuff_* > final_sorted_stuff.csv


最后,如果您的文件不是CSV文件(例如它是tgz文件),则应该找到一种将CSV版本的文件通过管道传输到split的方法.


Finally, if your file is not in CSV (say it is a tgz file), then you should find a way to pipe a CSV version of it into split.

这篇关于对大数据集按 pandas 排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆