在python dask中使用分隔符读取csv [英] Reading csv with separator in python dask
问题描述
我正在尝试通过读取以'#####'5个散列分隔的csv文件来创建DataFrame
I am trying to create a DataFrame
by reading a csv file separated by '#####' 5 hashes
代码是:
import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',engine='python')
res = df.compute()
错误是:
dask.async.ValueError:
Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns. These first 1,000 rows led us to an incorrect
guess.
For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row.
You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes.
df = dd.read_csv(..., dtype={'my-column': float})
Pandas has given us the following error when trying to parse the file:
"The 'dtype' option is not supported with the 'python' engine"
Traceback
---------
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/dataframe/io.py", line 69, in _read_csv
raise ValueError(msg)
那么如何摆脱这种情况.
So how to get rid of that.
如果我遵循该错误,那么我将不得不为每列提供dtype,但是如果我有100多个列,那将毫无用处.
If i follow the error then i would have to give dtype for every column, but if I have a 100+ columns then that is of no use.
如果我在没有分隔符的情况下阅读,那么一切都很好,但是到处都有#####.那么在将其计算为熊猫DataFrame
之后,有没有办法摆脱它?
And if i am reading without separator,then everything goes fine but there is ##### everywhere. So after computing it to pandas DataFrame
,is there a way to get rid of that?
在这方面帮助我.
推荐答案
以dtype=object
的形式读取整个文件,这意味着所有列将被解释为类型object
.这应该正确读入,摆脱每一行中的#####
.在这里,您可以使用compute()
方法将其变成熊猫框架.数据放入pandas框架后,您可以使用pandas infer_objects
方法来更新类型,而无需进行硬操作.
Read the entire file in as dtype=object
, meaning all columns will be interpreted as type object
. This should read in correctly, getting rid of the #####
in each row. From there you can turn it into a pandas frame using the compute()
method. Once the data is in a pandas frame, you can use the pandas infer_objects
method to update the types without having to hard .
import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()
这篇关于在python dask中使用分隔符读取csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!