使用2个csv文件创建的字典运行代码时,Python被杀死:9 [英] Python Killed: 9 when running a code using dictionaries created from 2 csv files

查看:94
本文介绍了使用2个csv文件创建的字典运行代码时,Python被杀死:9的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行对我一直有效的代码.这次,我在2个.csv文件上运行了它:数据"(24 MB)和数据1"(475 MB). 数据"具有3列,每列约680000个元素,而数据1"具有3列,每列33000000个元素.运行代码时,经过大约5分钟的处理,我得到的是杀死:9".如果这是内存问题,如何解决?任何建议都欢迎!

I am running a code that has always worked for me. This time I ran it on 2 .csv files: "data" (24 MB) and "data1" (475 MB). "data" has 3 columns of about 680000 elements each, whereas "data1" has 3 columns of 33000000 elements each. When I run the code, I get just "Killed: 9" after some 5 minutes of processing. If this is a memory problem, how to solve it?. Any suggestion is welcome !

这是代码:

import csv
import numpy as np

from collections import OrderedDict # to save keys order

from numpy import genfromtxt
my_data = genfromtxt('data.csv', dtype='S', 
                 delimiter=',', skip_header=1) 
my_data1 = genfromtxt('data1.csv', dtype='S', 
                 delimiter=',', skip_header=1) 

d= OrderedDict((rows[2],rows[1]) for rows in my_data)
d1= dict((rows[0],rows[1]) for rows in my_data1) 

dset = set(d) # returns keys
d1set = set(d1)

d_match = dset.intersection(d1) # returns matched keys

import sys  
sys.stdout = open("rs_pos_ref_alt.csv", "w") 

for row in my_data:
    if row[2] in d_match: 
        print [row[1], row[2]]

数据"的标题是:

    dbSNP RS ID Physical Position
0   rs4147951   66943738
1   rs2022235   14326088
2   rs6425720   31709555
3   rs12997193  106584554
4   rs9933410   82323721
5   rs7142489   35532970

"data1"的标题是:

The header of "data1" is:

    V2  V4  V5
10468   TC  T
10491   CC  C
10518   TG  T
10532   AG  A
10582   TG  T

推荐答案

最有可能的内核将其杀死,因为您的脚本占用了过多的内存. 您需要采用其他方法,并尝试最小化内存中的数据大小.

Most likely kernel kills it because your script consumes too much of memory. You need to take different approach and try to minimize size of data in memory.

您可能还会发现此问题很有用:使用Python和NumPy的大型矩阵

You may also find this question useful: Very large matrices using Python and NumPy

在下面的代码片段中,我试图通过逐行处理避免将巨大的data1.csv加载到内存中.试试看.

In the following code snippet I tried to avoid loading huge data1.csv into memory by processing it line-by-line. Give it a try.

import csv

from collections import OrderedDict # to save keys order

with open('data.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader) #skip header
    d = OrderedDict((rows[2], {"val": rows[1], "flag": False}) for rows in reader)

with open('data1.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader) #skip header
    for rows in reader:
        if rows[0] in d:
            d[rows[0]]["flag"] = True

import sys
sys.stdout = open("rs_pos_ref_alt.csv", "w")

for k, v in d.iteritems():
    if v["flag"]:
        print [v["val"], k]

这篇关于使用2个csv文件创建的字典运行代码时,Python被杀死:9的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆