Dask 2.1.0,KeyError:“未找到列:0" [英] Dask 2.1.0, KeyError: 'Column not found: 0'
问题描述
我正在使用dask读取大型csv数据文件,并且尝试对所得数据帧执行groupby.但是,我继续收到
I am reading in large csv data files using dask and I am attempting to perform a groupby on the resulting dataframe. However, I continue to receive
KeyError:'找不到列:0'
在生成的dask数据帧上
on the resulting dask dataframe
我已经在Dask 1.2.2和2.1.0上复制了该问题.我在同一数据帧上看不到Pandas的问题.我在所有情况下都使用Python 3.6
I have replicated the problem on both Dask 1.2.2 and 2.1.0. I do not see the problem with Pandas on the same dataframe. I am using Python 3.6 in all cases
为帮助说明问题,我已经能够简化代码并将问题复制到更简单的数据集上.
To help illustrate the problem I have been able to simplify the code and replicate the problem on a much simpler dataset.
import pandas as pd
from dask import dataframe as dd
from dask import multiprocessing
from dask.distributed import Client
client = Client(processes=False)
data = {
'col1': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'col2': ['apple','bananna','orange','apple','bananna','orange','apple','bananna','orange'],
'col3': [34, 12, 1, 36, 22, 6, 22, 16, 4]
}
pdf = pd.DataFrame(data=data)
print('************* Pandas DataFrame')
print(pdf.head(5))
print('')
print('Performing groupby on Pandas DataFrame')
pgroup = pdf.groupby(by='col2')
for name, group in pgroup:
print('')
print(f'Group: {name}')
print(group.head(5))
print(' ')
print(' ')
ddf = dd.from_pandas(data=pdf, npartitions=1)
print('************* Dask DataFrame')
print(ddf.head(5))
print('')
print('Performing groupby on Dask DataFrame')
dgroup = ddf.groupby(by='col2')
for name, group in dgroup:
print('')
print(f'Group: {name}')
print(group.head(5))
我希望dask数据框能够提供与Pandas结果相同的groupby结果.但是,我收到以下输出和错误
I would have expected the dask dataframe to provide the same groupby result as the Pandas results. However, I received the following output and error
************* Pandas DataFrame
col1 col2 col3
0 1 apple 34
1 1 bananna 12
2 1 orange 1
3 2 apple 36
4 2 bananna 22
Performing groupby on Pandas DataFrame
Group: apple
col1 col2 col3
0 1 apple 34
3 2 apple 36
6 3 apple 22
Group: bananna
col1 col2 col3
1 1 bananna 12
4 2 bananna 22
7 3 bananna 16
Group: orange
col1 col2 col3
2 1 orange 1
5 2 orange 6
8 3 orange 4
************* Dask DataFrame
col1 col2 col3
0 1 apple 34
1 1 bananna 12
2 1 orange 1
3 2 apple 36
4 2 bananna 22
Performing groupby on Dask DataFrame
Traceback (most recent call last):
File "C:\Users\Craig\source\repos\cevans3098\MarketData_preProcessor\module1.py", line 37, in <module>
for name, group in dgroup:
File "F:\anaconda3\lib\site-packages\dask\dataframe\groupby.py", line 1525, in __getitem__
g._meta = g._meta[key]
File "F:\anaconda3\lib\site-packages\pandas\core\base.py", line 275, in __getitem__
raise KeyError("Column not found: {key}".format(key=key))
KeyError: 'Column not found: 0'
推荐答案
DataFrameGroupBy .__ iter __
尚未为Dask Dataframe实现:
DataFrameGroupBy.__iter__
isn't implemented for Dask Dataframe yet: https://github.com/dask/dask/issues/5124
这篇关于Dask 2.1.0,KeyError:“未找到列:0"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!