需要在python中逐行合并2个大csv文件 [英] Need to merge 2 large csv files row by row in python

查看:361
本文介绍了需要在python中逐行合并2个大csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个项目的2个大型数据集文件。我管理清理文件逐行。然而,在尝试应用相同的逻辑来合并基于公共列的2个文件时,它失败。问题是第二个循环完全运行,然后顶部循环运行(不知道为什么会发生)。我尝试使用numpy

I am processing 2 large data set files for a project. I managed the clean the files row by row. However on trying to apply the same logic to merge the 2 files based on a common column, it fails. The problem is that the second loop runs completely and then the top loop runs(don't know why this happened). I tried using numpy

buys = np.genfromtxt('buys_dtsep.dat',delimiter=",",dtype='str')
clicks = np.genfromtxt('clicks_dtsep.dat',delimiter=",",dtype='str')
f = open('combined.dat', 'w')
for s in clicks:
    for s2 in buys:
      #process data

但将一个包含3300万个条目的文件加载到数组中是不可行的,因为内存限制以及将数据加载到数组中然后处理它所需的时间。我试图逐行处理文件,以避免耗尽内存。

but loading a file with 33 million entries into an array is not feasible due to memory restrictions and the time it would take to load the data into an array and then process it. I'm trying to process the files line by line to avoid running out of memory.

buys = open('buys_dtsep.dat')
clicks = open('clicks_dtsep.dat')
f = open('combined.dat', 'w')

csv_buys = csv.reader(buys)
csv_clicks = csv.reader(clicks)



for s in csv_clicks:
     print 'file 1 row x'#to check when it loops
     for s2 in csv_buys:
        print s2[0] #check looped data  
          #do merge op

输出的打印应为

file 1 row 0
file 2 row 0
 ...
file 2 row x
file 1 row 1
and so on

输出i get is

output i get is

file 2 row 0
file 2 row 1
...
file 2 row x
file 1 row 0
...
file 1 row z

更新:样例数据

购买文件示例

420374,2014-04-06,18:44:58.314,214537888,12462,1
420374,2014-04-06,18:44:58.325,214537850,10471,1
281626,2014-04-06,09:40:13.032,214535653,1883,1
420368,2014-04-04,06:13:28.848,214530572,6073,1
420368,2014-04-04,06:13:28.858,214835025,2617,1
140806,2014-04-07,09:22:28.132,214668193,523,1
140806,2014-04-07,09:22:28.176,214587399,1046,1

点击文件示例

420374,2014-04-06,18:44:58,214537888,0
420374,2014-04-06,18:41:50,214537888,0
420374,2014-04-06,18:42:33,214537850,0
420374,2014-04-06,18:42:38,214537850,0
420374,2014-04-06,18:43:02,214537888,0
420374,2014-04-06,18:43:10,214537888,0
420369,2014-04-07,19:39:43,214839373,0
420369,2014-04-07,19:39:56,214684513,0


推荐答案

帮帮我。它被设计为更快,并减少你的内存需求:

The following approach will hopefully help. It is designed to be faster and reduce your memory requirements:

from heapq import merge
from itertools import groupby, ifilter

def get_click_entries(key):
    with open('clicks.csv', 'rb') as f_clicks:
        for entry in ifilter(lambda x: int(x[0]) == key, csv.reader(f_clicks)):
            entry.insert(4, '')  # add empty missing column
            yield entry

# First create a set holding all column 0 click entries

with open('clicks.csv', 'rb') as f_clicks:
    csv_clicks = csv.reader(f_clicks)
    click_keys = {int(cols[0]) for cols in csv_clicks}

with open('buys.csv', 'rb') as f_buys, \
    open('clicks.csv', 'rb') as f_clicks,   \
    open('merged.csv', 'wb') as f_merged:

    csv_buys = csv.reader(f_buys)
    csv_clicks = csv.reader(f_clicks)
    csv_merged = csv.writer(f_merged)

    for k, g in groupby(csv_buys, key=lambda x: int(x[0])):
        if k in click_keys:
            buys = sorted(g, key=lambda x: (x[1], x[2]))
            clicks = sorted(get_click_entries(k), key=lambda x: (x[1], x[2]))
            csv_merged.writerows(merge(buys, clicks))       # merge the two lists based on the timestamp
            click_keys.remove(k)
        csv_merged.writerows(g)

    # Write any remaining click entries

    for k in click_keys:
        csv_merged.writerows(get_click_entries(k))

对于这两个示例文件,将产生以下输出:

For your two sample files, this would produce the following output:

140806,2014-04-07,09:22:28.132,214668193,523,1
140806,2014-04-07,09:22:28.176,214587399,1046,1
281626,2014-04-06,09:40:13.032,214535653,1883,1
420368,2014-04-04,06:13:28.848,214530572,6073,1
420368,2014-04-04,06:13:28.858,214835025,2617,1
420374,2014-04-06,18:41:50,214537888,,0
420374,2014-04-06,18:42:33,214537850,,0
420374,2014-04-06,18:42:38,214537850,,0
420374,2014-04-06,18:43:02,214537888,,0
420374,2014-04-06,18:43:10,214537888,,0
420374,2014-04-06,18:44:58,214537888,,0
420374,2014-04-06,18:44:58.314,214537888,12462,1
420374,2014-04-06,18:44:58.325,214537850,10471,1
420369,2014-04-07,19:39:43,214839373,,0
420369,2014-04-07,19:39:56,214684513,,0

它首先创建一组所有的0列条目,这意味着你可以避免重读如果知道该条目不存在,则整个点击文件。然后,它尝试从购买中读取一组匹配的列0条目,并从点击。然后,这些将基于时间戳进行排序,并按顺序合并在一起。此条目随后从集合中删除,因此不会重新读取。

It works by first creating a set of all of the column 0 entries, this then means you can avoid rereading the whole click file if it is known that the entry is not present. It then attempts to read in a group of matching column 0 entries from buys, and reads in a corresponding list of column 0 entries from clicks. These are then sorted based on the timestamp and merged together in order. This entry is then removed from the set so they are not reread.

这篇关于需要在python中逐行合并2个大csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆