ValueError:计算数据中的列与提供的元数据中的列不匹配 [英] ValueError: The columns in the computed data do not match the columns in the provided metadata
问题描述
我正在kaggle竞赛中处理具有550万行的数据集.在熊猫中,读取.csv并进行处理需要几个小时.
I am working on a dataset with 5.5 millions rows in a kaggle competition. Reading the .csv and processing them take hours in Pandas.
这里很快.速度快,但有很多错误.
Here comes in dask. Dask is fast but with many errors.
这是代码段,
#drop some columns
df = df.drop(['dropoff_latitude', 'dropoff_longitude','pickup_latitude', 'pickup_longitude', 'pickup_datetime' ], axis=1)
# In[ ]:
#one-hot-encode cat columns
df = dd.get_dummies(df.categorize())
# In[ ]:
#split train and test and export as csv
test_df = df[df['fare_amount'] == -9999]
train_df = df[df['fare_amount'] != -9999]
test_df.to_csv('df_test.csv')
train_df.to_csv('df_train.csv')
运行时线;
test_df.to_csv('df_test.csv')
train_df.to_csv('df_train.csv')
产生错误
ValueError: The columns in the computed data do not match the columns
in the provided metadata
什么可能导致此问题,以及如何阻止它.
What could cause this and how can I stop it.
N.B First time using Dask.
推荐答案
文档字符串描述了从CSV读取时如何出现这种情况.如果您完成了 len(dd.read_csv(...))
,您可能已经看到它了,没有掉落的假人和火车被分开了.错误消息可能会准确告诉您问题所在的列,以及预期的类型与发现的类型.
The docstring describes how this situation can arise when reading from CSV. Likely, if you had done len(dd.read_csv(...))
, you would have seen it already, without the drop, dummies and train-split. The error message probably tells you exactly which column(s) are the problem, and what type was expected versus what was found.
发生的事情是,dask从第一个文件的第一个块中猜测出数据帧的dtypes.有时,这不能反映整个数据集中的类型:例如,如果某列在第一个块中碰巧没有任何值,则其类型将为 float64
,因为熊猫使用了 nan
作为NULL占位符.在这种情况下,您要确定正确的dtype,并使用 dtype =
关键字将其提供给 read_csv
.有关 dtype =
的典型用法和其他用于数据解析的参数,请参阅pandas文档.转换可能在加载时有所帮助.
What happens, is that dask guesses the dtypes of the data-frame from the first block of the first file. Sometimes this does not reflect the type throughout the whole dataset: for example, if a column happens to have no values in the first block, its type will be float64
, because pandas uses nan
as a NULL placeholder. In such cases, you want to determine the correct dtypes, and supply them to read_csv
using the dtype=
keyword. See the pandas documentation for the typical use of dtype=
and other arguments for data parsing.conversion that might help at load time.
这篇关于ValueError:计算数据中的列与提供的元数据中的列不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!