使用块大小迭代获取推断的数据帧类型 [英] Get inferred dataframe types iteratively using chunksize
问题描述
如何使用pd.read_csv()迭代遍历文件并 保留dtype和其他元信息,就像我在阅读全文时一样 数据集一次?
How can I use pd.read_csv() to iteratively chunk through a file and retain the dtype and other meta-information as if I read in the entire dataset at once?
我需要读取一个太大而无法放入内存的数据集.我想使用pd.read_csv导入文件,然后立即将块附加到HDFStore中.但是,数据类型推断对后续块一无所知.
I need to read in a dataset that is too large to fit into memory. I would like to import the file using pd.read_csv and then immediately append the chunk into an HDFStore. However, the data type inference knows nothing about subsequent chunks.
如果存储在表中的第一个块仅包含int,随后的块包含浮点数,则将引发异常.因此,我需要首先使用read_csv遍历数据帧并保留 highest 推断的类型.另外,对于对象类型,我需要保留最大长度,因为这些长度将作为字符串存储在表中.
If the first chunk stored in the table contains only int and a subsequent chunk contains a float, an exception will be raised. So I need to first iterate through the dataframe using read_csv and retain the highest inferred type. In addition, for object types, I need to retain the maximum length as these will be stored as strings in the table.
是否有一种Pandonic方式仅保留此信息而无需读取整个数据集?
Is there a pandonic way of retaining only this information without reading in the entire dataset?
推荐答案
我不认为这会很直观,否则我不会发布这个问题.但是熊猫再次使事情变得轻而易举.但是,保留此问题是因为此信息可能对使用大数据的其他人有用:
I didn't think it would be this intuitive, otherwise I wouldn't have posted the question. But once again, pandas makes things a breeze. However, keeping the question as this information might be useful to others working with large data:
In [1]: chunker = pd.read_csv('DATASET.csv', chunksize=500, header=0)
# Store the dtypes of each chunk into a list and convert it to a dataframe:
In [2]: dtypes = pd.DataFrame([chunk.dtypes for chunk in chunker])
In [3]: dtypes.values[:5]
Out[3]:
array([[int64, int64, int64, object, int64, int64, int64, int64],
[int64, int64, int64, int64, int64, int64, int64, int64],
[int64, int64, int64, int64, int64, int64, int64, int64],
[int64, int64, int64, int64, int64, int64, int64, int64],
[int64, int64, int64, int64, int64, int64, int64, int64]], dtype=object)
# Very cool that I can take the max of these data types and it will preserve the hierarchy:
In [4]: dtypes.max().values
Out[4]: array([int64, int64, int64, object, int64, int64, int64, int64], dtype=object)
# I can now store the above into a dictionary:
types = dtypes.max().to_dict()
# And pass it into pd.read_csv fo the second run:
chunker = pd.read_csv('tree_prop_dset.csv', dtype=types, chunksize=500)
这篇关于使用块大小迭代获取推断的数据帧类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!