使用 Pandas 高效读取大型 CSV 文件而不会崩溃 [英] Using pandas to efficiently read in a large CSV file without crashing

查看:82
本文介绍了使用 Pandas 高效读取大型 CSV 文件而不会崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 http://grouplens 读取一个名为 ratings.csv 的 .csv 文件.org/datasets/movielens/20m/ 文件在我的电脑里是 533.4MB.

I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/ the file is 533.4MB in my computer.

这就是我在 jupyter notebook 中写的内容

This is what am writing in jupyter notebook

import pandas as pd
ratings = pd.read_cv('./movielens/ratings.csv', sep=',')

这里的问题是内核会崩溃或死亡并要求我重新启动并且它不断重复相同的内容.没有任何错误.请您提出解决此问题的任何替代方案,就好像我的计算机没有运行它的能力一样.

The problem from here is the kernel would break or die and ask me to restart and its keeps repeating the same. There is no any error. Please can you suggest any alternative of solving this, it is as if my computer has no capability of running this.

这有效,但它不断重写

chunksize = 20000
for ratings in pd.read_csv('./movielens/ratings.csv', chunksize=chunksize):
ratings.append(ratings)
ratings.head()

只写最后一块,其他的都注销

Only the last chunk is written others are written-off

推荐答案

您应该考虑使用 read_csv 在您的数据帧中读取时,因为它返回一个 TextFileReader 对象,您可以然后传递给 pd.concat 以连接您的块.

You should consider using the chunksize parameter in read_csv when reading in your dataframe, because it returns a TextFileReader object you can then pass to pd.concat to concatenate your chunks.

chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)

<小时>

如果您只想单独处理每个块,请使用,


If you just want to process each chunk individually, use,

chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv', 
                         chunksize=chunksize, 
                         iterator=True):
    do_something_with_chunk(chunk)

这篇关于使用 Pandas 高效读取大型 CSV 文件而不会崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆