可以使用dask对内核进行分组和重新编码吗? [英] Can dask be used to groupby and recode out of core?

查看:34
本文介绍了可以使用dask对内核进行分组和重新编码吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有8GB的csv文件和8GB的RAM.每个文件以这种形式每行有两个字符串:

I have 8GB csv files and 8GB of RAM. Each file has two strings per row in this form:

a,c
c,a
f,g
a,c
c,a
b,f
c,a

对于较小的文件,我会删除重复的文件,并计算前两列中每行的副本数,然后将字符串重新编码为整数如下:

For smaller files I remove duplicates counting how many copies of each row there were in the first two columns and then recode the strings to integers as follows:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.read_csv("file.txt", header=None, prefix="ID_")

# Perform the groupby (before converting letters to digits).
df = df.groupby(['ID_0', 'ID_1']).size().rename('count').reset_index()

# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)

# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)

这给出了:

   ID_0  ID_1  count
0     0     1      2
1     1     0      3
2     2     4      1
3     4     3      1

这正是这个玩具示例所需要的.

which is exactly what I need for this toy example.

对于较大的文件,由于缺少RAM,我无法采取这些步骤.

For the larger file I can't take these steps because of lack of RAM.

我可以想象有可能结合使用unix排序和定制的python解决方案,对数据进行多次传递以处理我的数据集.但是有人建议轻快也许是合适的.阅读文档后,我仍然不清楚.

I can imagine it is possible to combine unix sort and a bespoke python solution doing multiple passes over the data to process my data set. But someone suggested dask might be suitable. Having read the docs I am still not clear.

可以使用dask进行这种超出核心的处理吗?还是有其他一些超出核心的熊猫解决方案?

Can dask be used to do this sort of out of core processing or is there some other out of core pandas solution?

推荐答案

假定分组数据框适合您的内存,则您对代码所做的更改应该很小.这是我的尝试:

Assuming that the grouped dataframe fits your memory, the change you would have to make to your code should be pretty minor. Here's my attempt:

import pandas as pd
from dask import dataframe as dd
from sklearn.preprocessing import LabelEncoder

# import the data as dask dataframe, 100mb per partition
# note, that at this point no data is read yet, dask will read the files
# once compute or get is called.
df = dd.read_csv("file.txt", header=None, prefix="ID_", blocksize=100000000)

# Perform the groupby (before converting letters to digits).
# For better understanding, let's split this into two parts:
#     (i) define the groupby operation on the dask dataframe and call compute()
#     (ii) compute returns a pandas dataframe, which we can then use for further analysis
pandas_df = df.groupby(['ID_0', 'ID_1']).apply(lambda x: len(x), columns=0).compute()
pandas_df = pandas_df.rename('count').reset_index()

# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(pandas_df[['ID_0', 'ID_1']].values.flat)

# Convert to digits.
pandas_df[['ID_0', 'ID_1']] = pandas_df[['ID_0', 'ID_1']].apply(le.transform)

在大熊猫中,一种可能的解决方案是读取文件块(将chunksize参数传递给read_csv),对单个块运行groupby并组合结果.

One possible solution in pandas would be to read the files in chunks (passing the chunksize argument to read_csv), running the groupby on individual chunks and combining the results.

这是在纯Python中解决问题的方法:

Here's how you can solve the problem in pure python:

counts = {}
with open('data') as fp:
    for line in fp:
        id1, id2 = line.rstrip().split(',')
        counts[(id1, id2)] = 1 + counts.get((id1, id2), 0)

df = pd.DataFrame(data=[(k[0], k[1], v) for k, v in counts.items()],
                  columns=['ID_0', 'ID_1', 'count'])
# apply label encoding etc.
le = LabelEncoder()
le.fit(df[['ID_0', 'ID_1']].values.flat)

# Convert to digits.
df[['ID_0', 'ID_1']] = df[['ID_0', 'ID_1']].apply(le.transform)

这篇关于可以使用dask对内核进行分组和重新编码吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆