在python dask中使用分隔符读取csv [英] Reading csv with separator in python dask

查看:450
本文介绍了在python dask中使用分隔符读取csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过读取以'#####'5个散列分隔的csv文件来创建DataFrame

I am trying to create a DataFrame by reading a csv file separated by '#####' 5 hashes

代码是:

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',engine='python')
res = df.compute()

错误是:

dask.async.ValueError:
Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns.  These first 1,000 rows led us to an incorrect
guess.

For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row.

You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes.

    df = dd.read_csv(..., dtype={'my-column': float})

Pandas has given us the following error when trying to parse the file:

  "The 'dtype' option is not supported with the 'python' engine"

Traceback
 ---------
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/dataframe/io.py", line 69, in _read_csv
raise ValueError(msg)

那么如何摆脱这种情况.

So how to get rid of that.

如果我遵循该错误,那么我将不得不为每列提供dtype,但是如果我有100多个列,那将毫无用处.

If i follow the error then i would have to give dtype for every column, but if I have a 100+ columns then that is of no use.

如果我在没有分隔符的情况下阅读,那么一切都很好,但是到处都有#####.那么在将其计算为熊猫DataFrame之后,有没有办法摆脱它?

And if i am reading without separator,then everything goes fine but there is ##### everywhere. So after computing it to pandas DataFrame ,is there a way to get rid of that?

在这方面帮助我.

推荐答案

dtype=object的形式读取整个文件,这意味着所有列将被解释为类型object.这应该正确读入,摆脱每一行中的#####.在这里,您可以使用compute()方法将其变成熊猫框架.数据放入pandas框架后,您可以使用pandas infer_objects方法来更新类型,而无需进行硬操作.

Read the entire file in as dtype=object, meaning all columns will be interpreted as type object. This should read in correctly, getting rid of the ##### in each row. From there you can turn it into a pandas frame using the compute() method. Once the data is in a pandas frame, you can use the pandas infer_objects method to update the types without having to hard .

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()

这篇关于在python dask中使用分隔符读取csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆