python - 使用具有大csv的pandas结构(iterate和chunksize) [英] python - Using pandas structures with large csv(iterate and chunksize)
问题描述
我有一个大的csv文件,大约600mb与1100万行,我想创建统计数据,如枢轴,直方图,图形等。显然试图只是为了正常阅读:
I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:
df = pd.read_csv('Check400_900.csv', sep='\t')
不工作,所以我发现iterate和chunksize在一个类似的职位,所以我使用
doesn't work so I found iterate and chunksize in a similar post so I used
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
所有好的,我可以例如打印df.get_chunk(5)
,并使用
All good, i can for example print df.get_chunk(5)
and search the whole file with just
for chunk in df:
print chunk
我的问题是我不知道如何使用像下面的东西整个df,而不是只有一个块
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk
plt.plot()
print df.head()
print df.describe()
print df.dtypes
customer_group3 = df.groupby('UserID')
y3 = customer_group.size()
我希望我的问题是不要那么混乱
I hope my question is not so confusing
推荐答案
我认为你需要 concat chunks to df,因为函数的输出类型:
I think you need concat chunks to df, because type of output of function:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
不是数据框架,但 pandas.io.parsers.TextFileReader
- 来源。
isn't dataframe, but pandas.io.parsers.TextFileReader
- source.
tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)
参数忽略索引 to function concat
,因为避免了索引的重复。
I think is necessary add parameter ignore index to function concat
, because avoiding duplicity of indexes.
这篇关于python - 使用具有大csv的pandas结构(iterate和chunksize)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!