我如何阅读大型csv(20G) [英] how do I read a large csv(20G)

查看:94
本文介绍了我如何阅读大型csv(20G)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的新用户.我的问题是:

I am a new user of python.my problem is this:

我有三个csv文件(每个文件约15G,有三列),我想将它们读入python并删除dur = 0的行我的csv就是这样.

I have three csv files (each is about 15G, and has three columns), and I want to read them into python and get rid of rows which dur=0 my csv is like this.

sn_fx   sn_tx   dur
5129789 3310325 2
5129789 5144184 1
5129789 5144184 1
5129789 5144184 1
5129789 5144184 1
5129789 6302346 4
5129789 6302346 0

我知道我应该逐行阅读,并且我会这样尝试:

I know I should read line by line , and I try like this :

file='cmct_0430x.csv'
for line in file.xreadlines():
    pass

但似乎不起作用.

此外,我不知道如何使这些行转换为数据框.

Besides, I do not know how to make these lines transform into a dataframe.

有人可以向我显示更多详细信息,我将非常感谢您!

Could someone show me more details about this, I will appreciate you very much!

推荐答案

您应该使用熊猫.并读取适当大小的 chunks 中的csv(已处理的行数).然后使用 concat 来获取所有块.

You should use pandas. And read the csv in chunks (number of rows processed) of suitable size. Then use concat to get all the chunks.

from pandas import *

tp = read_csv('cmct_0430x.csv', iterator=True, chunksize=1000)
df = concat(tp, ignore_index=True) 

Pandas: Read_csv

您收到内存错误,因为您一次处理整个 csv 的时间大于主内存的大小.尝试将其分成几部分,然后进行处理.

You are getting memory error because you are processing entire csv at a time which is larger than the size of your main memory. Try to break it in chunks and then process it.

这篇关于我如何阅读大型csv(20G)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆