最有效的方式来解析一个大的.csv在Python? [英] Most efficient way to parse a large .csv in python?

查看:167
本文介绍了最有效的方式来解析一个大的.csv在Python?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图看看其他答案,但我仍然不确定正确的方法这样做。
我有一些真正大的.csv文件(每个可以是一个千兆字节),我想先获得他们的列标签,因为他们不是一样的,然后根据用户偏好提取一些具有某些标准的列。
在我开始提取部分之前,我做了一个简单的测试,看看是什么是最快的方式来解析这个文件,这里是我的代码:

I tried to look on other answers but I am still not sure the right way to do this. I have a number of really large .csv files (could be a gigabyte each), and I want to first get their column labels, cause they are not all the same, and then according to user preference extract some of this columns with some criteria. Before I start the extraction part I did a simple test to see what is the fastest way to parse this files and here is my code:

def mmapUsage():
    start=time.time()
    with open("csvSample.csv", "r+b") as f:
        # memory-mapInput the file, size 0 means whole file
        mapInput = mmap.mmap(f.fileno(), 0)
        # read content via standard file methods
        L=list()
        for s in iter(mapInput.readline, ""):
            L.append(s)
        print "List length: " ,len(L)
        #print "Sample element: ",L[1]
        mapInput.close()
        end=time.time()
        print "Time for completion",end-start

def fileopenUsage():
    start=time.time()
    fileInput=open("csvSample.csv")
    M=list()
    for s in fileInput:
            M.append(s)
    print "List length: ",len(M)
    #print "Sample element: ",M[1]
    fileInput.close()
    end=time.time()
    print "Time for completion",end-start

def readAsCsv():
    X=list()
    start=time.time()
    spamReader = csv.reader(open('csvSample.csv', 'rb'))
    for row in spamReader:
        X.append(row)
    print "List length: ",len(X)
    #print "Sample element: ",X[1]
    end=time.time()
    print "Time for completion",end-start

我的结果:

=======================
Populating list from Mmap
List length:  1181220
Time for completion 0.592000007629

=======================
Populating list from Fileopen
List length:  1181220
Time for completion 0.833999872208

=======================
Populating list by csv library
List length:  1181220
Time for completion 5.06700015068

大多数人使用的csv库实际上比其他人慢。
也许后来证明是更快的,当我开始从csv文件提取数据,但我不能确定的。
我开始实施之前的任何建议和提示?
谢谢alot!

So it seems that the csv library most people use is really alot slower than the others. Maybe later it proves to be faster when I start extracting data from the csv file but I cannot be sure for that yet. Any suggestions and tips before I start implementing? Thanks alot!

推荐答案

正如前面提到的其他几种方法,前两种方法不需要实际的字符串解析,只是一次读取一行而不提取字段。我想象在CSV中看到的大部分速度差异是由于这一点。

As pointed out several other times, the first two methods do no actual string parsing, they just read a line at a time without extracting fields. I imagine the majority of the speed difference seen in CSV is due to that.

如果您包括任何文本数据,可能包括更多的标准的CSV模块是无价的'CSV语法,而不仅仅是逗号,特别是如果你是从Excel格式阅读。如果你只有1,2,3,4这样的线,你可能很容易用一个简单的分割,但如果你有线像1,2,'你好,我的名字\的fred'你会疯狂地试图解析,没有错误。 CSV还将透明地处理像带引号字符串中间的换行符。一个简单的for..in没有CSV将是有麻烦的。

The CSV module is invaluable if you include any textual data that may include more of the 'standard' CSV syntax than just commas, especially if you're reading from an Excel format. If you've just got lines like "1,2,3,4" you're probably fine with a simple split, but if you have lines like "1,2,'Hello, my name\'s fred'" you're going to go crazy trying to parse that without errors. CSV will also transparently handle things like newlines in the middle of a quoted string. A simple for..in without CSV is going to have trouble with that.

CSV模块一直工作正常我读unicode字符串,如果我这样使用它:

The CSV module has always worked fine for me reading unicode strings if I use it like so:

f = csv.reader(codecs.open(filename,'rU'))

f = csv.reader(codecs.open(filename, 'rU'))

强大的用于导入带有unicode,引用的字符串,在引用字符串中间的换行符,在末尾缺少字段的行等的所有具有合理的读取时间的多千行文件。我会尝试使用它,只有寻找优化,如果你真的需要额外的速度。

It is plenty of robust for importing multi-thousand line files with unicode, quoted strings, newlines in the middle of quoted strings, lines with fields missing at the end, etc. all with reasonable read times. I'd try using it first and only looking for optimizations on top of it if you really need the extra speed.

这篇关于最有效的方式来解析一个大的.csv在Python?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆