python - 使用带有大型 csv(迭代和块大小)的 Pandas 结构 [英] python - Using pandas structures with large csv(iterate and chunksize)
问题描述
我有一个很大的 csv 文件,大约 600 mb 有 1100 万行,我想创建统计数据,如数据透视图、直方图、图表等.显然只是想正常读取它:
I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:
df = pd.read_csv('Check400_900.csv', sep=' ')
不起作用所以我在类似的帖子中找到了 iterate 和 chunksize 所以我使用了
doesn't work so I found iterate and chunksize in a similar post so I used
df = pd.read_csv('Check1_900.csv', sep=' ', iterator=True, chunksize=1000)
一切都很好,我可以例如print df.get_chunk(5)
并只用
All good, i can for example print df.get_chunk(5)
and search the whole file with just
for chunk in df:
print chunk
我的问题是我不知道如何在整个 df 中使用下面这样的东西,而不仅仅是一个块
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk
plt.plot()
print df.head()
print df.describe()
print df.dtypes
customer_group3 = df.groupby('UserID')
y3 = customer_group.size()
希望我的问题不要太混乱
I hope my question is not so confusing
推荐答案
解决方案,如果需要创建一个大 DataFrame
如果需要一次处理所有数据(什么是可能的,但不是 推荐:
Solution, if need create one big DataFrame
if need processes all data at once (what is possible, but not recommended):
然后对所有块使用 concat 到 df, 因为函数的输出类型:
Then use concat for all chunks to df, because type of output of function:
df = pd.read_csv('Check1_900.csv', sep=' ', iterator=True, chunksize=1000)
不是数据框,而是 pandas.io.parsers.TextFileReader
- 源.
isn't dataframe, but pandas.io.parsers.TextFileReader
- source.
tp = pd.read_csv('Check1_900.csv', sep=' ', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)
我认为有必要添加参数 忽略索引来实现concat
,因为避免了索引的重复.
I think is necessary add parameter ignore index to function concat
, because avoiding duplicity of indexes.
但是如果想要处理像聚合这样的大数据,最好使用 dask
,因为它提供了高级并行性.
But if want working with large data like aggregating, much better is use dask
, because it provides advanced parallelism.
这篇关于python - 使用带有大型 csv(迭代和块大小)的 Pandas 结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!