使用 Pandas 高效读取大型 CSV 文件而不会崩溃 [英] Using pandas to efficiently read in a large CSV file without crashing
问题描述
我正在尝试从 http://grouplens 读取一个名为 ratings.csv 的 .csv 文件.org/datasets/movielens/20m/ 文件在我的电脑里是 533.4MB.
I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/ the file is 533.4MB in my computer.
这就是我在 jupyter notebook 中写的内容
This is what am writing in jupyter notebook
import pandas as pd
ratings = pd.read_cv('./movielens/ratings.csv', sep=',')
这里的问题是内核会崩溃或死亡并要求我重新启动并且它不断重复相同的内容.没有任何错误.请您提出解决此问题的任何替代方案,就好像我的计算机没有运行它的能力一样.
The problem from here is the kernel would break or die and ask me to restart and its keeps repeating the same. There is no any error. Please can you suggest any alternative of solving this, it is as if my computer has no capability of running this.
这有效,但它不断重写
chunksize = 20000
for ratings in pd.read_csv('./movielens/ratings.csv', chunksize=chunksize):
ratings.append(ratings)
ratings.head()
只写最后一块,其他的都注销
Only the last chunk is written others are written-off
推荐答案
您应该考虑使用 read_csv
在您的数据帧中读取时,因为它返回一个 TextFileReader
对象,您可以然后传递给 pd.concat
以连接您的块.
You should consider using the chunksize
parameter in read_csv
when reading in your dataframe, because it returns a TextFileReader
object you can then pass to pd.concat
to concatenate your chunks.
chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)
<小时>
如果您只想单独处理每个块,请使用,
If you just want to process each chunk individually, use,
chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv',
chunksize=chunksize,
iterator=True):
do_something_with_chunk(chunk)
这篇关于使用 Pandas 高效读取大型 CSV 文件而不会崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!