使用pandas的read_csv时设置特定列的数据类型 [英] Set data type for specific column when using read_csv from pandas
问题描述
我有一个很大的csv文件(〜10GB),大约有4000列.我知道我期望的大多数数据都是int8,所以我设置了:
I have a large csv file (~10GB), with around 4000 columns. I know that most of data i will expect is int8, so i set:
pandas.read_csv('file.dat', sep=',', engine='c', header=None,
na_filter=False, dtype=np.int8, low_memory=False)
问题是,最后一列(第4000个位置)是int32,我是否可以告诉read_csv默认使用int8并在第4000列使用int 32?
Thing is, the final column (4000th position) is int32, is there away can i tell read_csv that use int8 by default, and at column 4000th, use int 32?
谢谢
推荐答案
如果确定数字,可以重新创建字典,如下所示:
If you are certain of the number you could recreate the dictionary like this:
dtype = dict(zip(range(4000),['int8' for _ in range(3999)] + ['int32']))
考虑到这可行:
import pandas as pd
import numpy as np
data = '''\
1,2,3
4,5,6'''
fileobj = pd.compat.StringIO(data)
df = pd.read_csv(fileobj, dtype={0:'int8',1:'int8',2:'int32'}, header=None)
print(df.dtypes)
返回:
0 int8
1 int8
2 int32
dtype: object
从文档中
dtype:类型名称或列的字典->类型,默认为无
dtype : Type name or dict of column -> type, default None
数据或列的数据类型.例如. {‘a’:np.float64,‘b’:np.int32} 使用str或object保留而不解释dtype.如果转换器 指定后,它们将应用于dtype转换的INSTEAD.
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
这篇关于使用pandas的read_csv时设置特定列的数据类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!