寻找更有效的方式重新组织了大规模的CSV在Python [英] Looking for a more efficient way to reorganize a massive CSV in Python

查看:136
本文介绍了寻找更有效的方式重新组织了大规模的CSV在Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直对一个问题,我从一个大的输出.txt文件都有数据,现在要分析和在.csv的方式重组某些值。

I've been working on a problem where I have data from a large output .txt file, and now have to parse and reorganize certain values in the the form of a .csv.

我已经写了输入所有数据到.csv列基于什么样的数据是(航班号,经纬度等),但它不是以正确的顺序的脚本。所有数值均指基于相同的航班号,以便从最早的时间标记到最新的要分组。幸运的是,我的.csv在正确的时间顺序排列的所有值,但不能组合在一起适当地根据航班号的。

I've already written a script that input all the data into a .csv in columns based on what kind of data it is (Flight ID, Latitude, Longitude, etc), but it's not in the correct order. All values are meant to be grouped based on the same Flight ID, in order from earliest time stamp to the latest. Fortunately, my .csv has all values in the correct time order, but not grouped together appropriately according to Flight ID's.

要清除我的描述时,它看起来是这样的,现在,

To clear my description up, it looks like this right now,

(时间X只是为了说明):

("Time x" is just to illustrate):

20110117559515, , , , , , , , ,2446,6720,370,42  (Time 0)                               
20110117559572, , , , , , , , ,2390,6274,410,54  (Time 0)                               
20110117559574, , , , , , , , ,2391,6284,390,54  (Time 0)                               
20110117559587, , , , , , , , ,2385,6273,390,54  (Time 0)                               
20110117559588, , , , , , , , ,2816,6847,250,32  (Time 0) 
... 

和它应该被命令是这样的:

and it's supposed to be ordered like this:

20110117559515, , , , , , , , ,2446,6720,370,42  (Time 0)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time 1)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time 2)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time 3)
20110117559515, , , , , , , , ,24xx,67xx,3xx,42  (Time N)
20110117559572, , , , , , , , ,2390,6274,410,54  (Time 0)
20110117559572, , , , , , , , ,23xx,62xx,4xx,54  (Time 1)
... and so on

有130万在.csv I输出某些行,使事情变得更容易。我99%的信心在接下来的剧本我写了修复顺序是正确的逻辑,但我担心的是,这是非常低效的。最后我加入一个进度条只是为了看看它是否取得任何进展,不幸的是,这是我所看到的:

There are 1.3 million some rows in the .csv I output to make things easier. I'm 99% confident the logic in the next script I wrote to fix the ordering is correct, but my fear is that it's extremely inefficient. I ended up adding a progress bar just to see if it's making any progress, and unfortunately this is what I see:

下面是我的code处理捣鼓(直接跳到问题的区域,如果你喜欢):

Here's my code handling the crunching (skip down to problem area if you like):

## a class I wrote to handle the huge .csv's ##
from BIGASSCSVParser import BIGASSCSVParser               
import collections                                                              


x = open('newtrajectory.csv')  #file to be reordered                                                  
linetlist = []                                                                  
tidict = {}               

'' To save braincells I stored all the required values
   of each line into a dictionary of tuples.
   Index: Tuple ''

for line in x:                                                                  
    y = line.replace(',',' ')                                                   
    y = y.split()                                                               
    tup = (y[0],y[1],y[2],y[3],y[4])                                            
    linetlist.append(tup)                                                       
for k,v in enumerate(linetlist):                                                
    tidict[k] = v                                                               
x.close()                                                                       


trj = BIGASSCSVParser('newtrajectory.csv')                                      
uniquelFIDs = []                                                                
z = trj.column(0)   # List of out of order Flight ID's                                                     
for i in z:         # like in the example above                                                           
    if i in uniquelFIDs:                                                        
        continue                                                                
    else:                                                                       
        uniquelFIDs.append(i)  # Create list of unique FID's to refer to later                                               

queue = []                                                                              
p = collections.OrderedDict()                                                   
for k,v in enumerate(trj.column(0)):                                            
    p[k] = v  

所有好为止,但它在下一节中我的电脑无论是扼流圈,还是我的code刚刚吸:

All good so far, but it's in this next segment my computer either chokes, or my code just sucks:

for k in uniquelFIDs:                                                           
    list = [i for i, x in p.items() if x == k]                                  
    queue.extend(list)                                                          

当时的想法是,对于每一个独特的价值,从而,迭代130万价值和回报,从而,每次出现的指数,并添加这些值的列表。从那以后,我正要读出了大量的索引列表,并写了该行的数据内容到另一个.csv文件。ただ!可能效率非常低。

The idea was that for every unique value, in order, iterate over the 1.3 million values and return, in order, each occurrence's index, and append those values to a list. After that I was just going to read off that large list of indexes and write the contents of that row's data into another .csv file. Ta da! Probably hugely inefficient.

什么是错在这里?有没有更有效的方法来做到这一点问题呢?难道我的code有缺陷的,还是我只是残忍到我的笔记本电脑?

What's wrong here? Is there a more efficient way to do this problem? Is my code flawed, or am I just being cruel to my laptop?

更新:

我发现,随着数据量的我捣鼓,它会需要9-10个小时。我有一半是正确吐出来的4.5。隔夜紧缩我可以逃脱的了,但很可能会寻求使用一个数据库或其他语言下一次。我会,如果我知道我是进入提前在上海。

I've found that with the amount of data I'm crunching, it'll take 9-10 hours. I had half of it correctly spat out in 4.5. An overnight crunch I can get away with for now, but will probably look to use a database or another language next time. I would have if I knew what I was getting into ahead of time, lol.

调整睡眠设置为我的固态硬盘后,只用了3个小时紧缩。

After adjusting sleep settings for my SSD, it only took 3 hours to crunch.

推荐答案

如果CSV文件将融入您的RAM(例如,小于2GB),那么你可以阅读整个事情,做一个排序就可以了:

If the CSV file would fit into your RAM (e.g. less than 2GB), then you can just read the whole thing and do a sort on it:

data = list(csv.reader(fn))
data.sort(key=lambda line:line[0])
csv.writer(outfn).writerows(data)

这不应该近等长,如果你不鞭打。需要注意的是 .sort 稳定的排序的,所以它会preserve文件的时间顺序时,密钥是相同的。

That shouldn't take nearly as long if you don't thrash. Note that .sort is a stable sort, so it will preserve the time order of your file when the keys are equal.

如果无法将其插入到内存中,你可能会想要做的东西有点聪明。例如,可以排序存储每行的文件偏移,伴随着从线的必要的信息(时间戳和飞行的ID),那么在那些,以及使用该线的偏移信息写入输出文件中。

If it won't fit into RAM, you will probably want to do something a bit clever. For example, you can store the file offsets of each line, along with the necessary information from the line (timestamp and flight ID), then sort on those, and write the output file using the line offset information.

这篇关于寻找更有效的方式重新组织了大规模的CSV在Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆