读取csv文件时的混合类型.原因,解决方法和后果 [英] Mixed types when reading csv files. Causes, fixes and consequences

查看:226
本文介绍了读取csv文件时的混合类型.原因,解决方法和后果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当熊猫发出此警告时会发生什么?我应该担心吗?

What exactly happens when Pandas issues this warning? Should I worry about it?

In [1]: read_csv(path_to_my_file)
/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/parsers.py:1139: 
DtypeWarning: Columns (4,13,29,51,56,57,58,63,87,96) have mixed types. Specify dtype option on import or set low_memory=False.              

  data = self._reader.read(nrows)

我认为这意味着Pandas无法从这些列上的值推断类型.但是,如果是这种情况,熊猫最终将使用那些类型的那些列?

I assume that this means that Pandas is unable to infer the type from values on those columns. But if that is the case, what type does Pandas end up using for those columns?

此外,事实之后是否总是可以恢复类型? (在收到警告后),或者在某些情况下我可能无法正确恢复原始信息,我应该预先指定类型吗?

Also, can the type always be recovered after the fact? (after getting the warning), or are there cases where I may not be able to recover the original info correctly, and I should pre-specify the type?

最后,low_memory=False如何解决该问题?

Finally, how exactly does low_memory=False fix the problem?

推荐答案

重新访问mbatchkarov的链接,low_memory已记录:

Revisiting mbatchkarov's link, low_memory is not deprecated. It is now documented:

low_memory :布尔值,默认为True

内部对文件进行分块处理,从而降低了内存使用量,而 解析,但可能是混合类型推断.确保没有 混合类型要么设置为False,要么使用 dtype 指定类型 范围.请注意,整个文件都被读取到单个DataFrame中 无论如何,请使用 chunksize iterator 参数返回数据 大块地. (仅对C解析器有效)

Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser)

我问过导致混合的是什么类型推断表示,并且chris-b1回答:

I have asked what resulting in mixed type inference means, and chris-b1 answered:

这是确定性的-一致地基于什么来推断类型 在数据中.也就是说,内部块大小不是固定数字 行数,而不是字节数,因此是否可以混合使用dtype警告 或没有感觉会有点随机.

It is deterministic - types are consistently inferred based on what's in the data. That said, the internal chunksize is not a fixed number of rows, but instead bytes, so whether you can a mixed dtype warning or not can feel a bit random.

那么,这些列最终使用的熊猫类型是什么?

下面的独立示例可以回答这个问题:

This is answered by the following self-contained example:

df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

type(df.loc[524287,'0'])
Out[50]: int

type(df.loc[524288,'0'])
Out[51]: str

csv数据的第一部分仅被视为int,因此转换为int, 第二部分也有一个字符串,因此所有条目都保留为字符串.

The first part of the csv data was seen as only int, so converted to int, the second part also had a string, so all entries were kept as string.

该类型是否可以在事实发生后始终恢复? (收到警告后)?

我想重新导出到csv并用low_memory=False重新读取应该可以完成这项工作.

I guess re-exporting to csv and re-reading with low_memory=False should do the job.

low_memory = False到底能解决什么问题?

它会在确定类型之前读取所有文件,因此需要更多的内存.

It reads all of the file before deciding the type, therefore needing more memory.

这篇关于读取csv文件时的混合类型.原因,解决方法和后果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆