使用 pandas 高效读取大型CSV文件而不会崩溃 [英] Using pandas to efficiently read in a large CSV file without crashing

查看:135
本文介绍了使用 pandas 高效读取大型CSV文件而不会崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 http://grouplens中读取一个名为rating.csv的.csv文件. org/datasets/movielens/20m/该文件在我的计算机中为533.4MB.

I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/ the file is 533.4MB in my computer.

这就是在jupyter笔记本中编写的内容

This is what am writing in jupyter notebook

import pandas as pd
ratings = pd.read_cv('./movielens/ratings.csv', sep=',')

这里的问题是内核会崩溃或死亡,并要求我重新启动,并且它不断重复执行相同的操作.没有任何错误.请您提出解决此问题的任何替代方案,就好像我的计算机没有运行此功能的能力一样.

The problem from here is the kernel would break or die and ask me to restart and its keeps repeating the same. There is no any error. Please can you suggest any alternative of solving this, it is as if my computer has no capability of running this.

这行得通,但它一直在重写

This works but it keeps rewriting

chunksize = 20000
for ratings in pd.read_csv('./movielens/ratings.csv', chunksize=chunksize):
ratings.append(ratings)
ratings.head()

仅最后一块被写入,其他部分被注销

Only the last chunk is written others are written-off

推荐答案

您应该考虑在

You should consider using the chunksize parameter in read_csv when reading in your dataframe, because it returns a TextFileReader object you can then pass to pd.concat to concatenate your chunks.

chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)


如果您只想单独处理每个块,请使用


If you just want to process each chunk individually, use,

chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv', 
                         chunksize=chunksize, 
                         iterator=True):
    do_something_with_chunk(chunk)

这篇关于使用 pandas 高效读取大型CSV文件而不会崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆