ValueError:计算数据中的列与提供的元数据中的列不匹配 [英] ValueError: The columns in the computed data do not match the columns in the provided metadata

查看:53
本文介绍了ValueError:计算数据中的列与提供的元数据中的列不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在kaggle竞赛中处理具有550万行的数据集.在熊猫中,读取.csv并进行处理需要几个小时.

I am working on a dataset with 5.5 millions rows in a kaggle competition. Reading the .csv and processing them take hours in Pandas.

这里很快.速度快,但有很多错误.

Here comes in dask. Dask is fast but with many errors.

这是代码段,

#drop some columns
df = df.drop(['dropoff_latitude', 'dropoff_longitude','pickup_latitude', 'pickup_longitude', 'pickup_datetime' ], axis=1)


# In[ ]:


#one-hot-encode cat columns
df = dd.get_dummies(df.categorize())


# In[ ]:


#split train and test and export as csv
test_df = df[df['fare_amount'] == -9999]
train_df = df[df['fare_amount'] != -9999]

test_df.to_csv('df_test.csv')
train_df.to_csv('df_train.csv')

运行时线;

test_df.to_csv('df_test.csv')
train_df.to_csv('df_train.csv')

产生错误

ValueError: The columns in the computed data do not match the columns
in the provided metadata

什么可能导致此问题,以及如何阻止它.

What could cause this and how can I stop it.

N.B First time using Dask.

推荐答案

文档字符串描述了从CSV读取时如何出现这种情况.如果您完成了 len(dd.read_csv(...)),您可能已经看到它了,没有掉落的假人和火车被分开了.错误消息可能会准确告诉您问题所在的列,以及预期的类型与发现的类型.

The docstring describes how this situation can arise when reading from CSV. Likely, if you had done len(dd.read_csv(...)), you would have seen it already, without the drop, dummies and train-split. The error message probably tells you exactly which column(s) are the problem, and what type was expected versus what was found.

发生的事情是,dask从第一个文件的第一个块中猜测出数据帧的dtypes.有时,这不能反映整个数据集中的类型:例如,如果某列在第一个块中碰巧没有任何值,则其类型将为 float64 ,因为熊猫使用了 nan 作为NULL占位符.在这种情况下,您要确定正确的dtype,并使用 dtype = 关键字将其提供给 read_csv .有关 dtype = 的典型用法和其他用于数据解析的参数,请参阅pandas文档.转换可能在加载时有所帮助.

What happens, is that dask guesses the dtypes of the data-frame from the first block of the first file. Sometimes this does not reflect the type throughout the whole dataset: for example, if a column happens to have no values in the first block, its type will be float64, because pandas uses nan as a NULL placeholder. In such cases, you want to determine the correct dtypes, and supply them to read_csv using the dtype= keyword. See the pandas documentation for the typical use of dtype= and other arguments for data parsing.conversion that might help at load time.

这篇关于ValueError:计算数据中的列与提供的元数据中的列不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆