使用2个csv文件创建的字典运行代码时,Python被杀死:9 [英] Python Killed: 9 when running a code using dictionaries created from 2 csv files
问题描述
我正在运行对我一直有效的代码.这次,我在2个.csv文件上运行了它:数据"(24 MB)和数据1"(475 MB). 数据"具有3列,每列约680000个元素,而数据1"具有3列,每列33000000个元素.运行代码时,经过大约5分钟的处理,我得到的是杀死:9".如果这是内存问题,如何解决?任何建议都欢迎!
I am running a code that has always worked for me. This time I ran it on 2 .csv files: "data" (24 MB) and "data1" (475 MB). "data" has 3 columns of about 680000 elements each, whereas "data1" has 3 columns of 33000000 elements each. When I run the code, I get just "Killed: 9" after some 5 minutes of processing. If this is a memory problem, how to solve it?. Any suggestion is welcome !
这是代码:
import csv
import numpy as np
from collections import OrderedDict # to save keys order
from numpy import genfromtxt
my_data = genfromtxt('data.csv', dtype='S',
delimiter=',', skip_header=1)
my_data1 = genfromtxt('data1.csv', dtype='S',
delimiter=',', skip_header=1)
d= OrderedDict((rows[2],rows[1]) for rows in my_data)
d1= dict((rows[0],rows[1]) for rows in my_data1)
dset = set(d) # returns keys
d1set = set(d1)
d_match = dset.intersection(d1) # returns matched keys
import sys
sys.stdout = open("rs_pos_ref_alt.csv", "w")
for row in my_data:
if row[2] in d_match:
print [row[1], row[2]]
数据"的标题是:
dbSNP RS ID Physical Position
0 rs4147951 66943738
1 rs2022235 14326088
2 rs6425720 31709555
3 rs12997193 106584554
4 rs9933410 82323721
5 rs7142489 35532970
"data1"的标题是:
The header of "data1" is:
V2 V4 V5
10468 TC T
10491 CC C
10518 TG T
10532 AG A
10582 TG T
推荐答案
最有可能的内核将其杀死,因为您的脚本占用了过多的内存. 您需要采用其他方法,并尝试最小化内存中的数据大小.
Most likely kernel kills it because your script consumes too much of memory. You need to take different approach and try to minimize size of data in memory.
您可能还会发现此问题很有用:使用Python和NumPy的大型矩阵
You may also find this question useful: Very large matrices using Python and NumPy
在下面的代码片段中,我试图通过逐行处理避免将巨大的data1.csv
加载到内存中.试试看.
In the following code snippet I tried to avoid loading huge data1.csv
into memory by processing it line-by-line. Give it a try.
import csv
from collections import OrderedDict # to save keys order
with open('data.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader) #skip header
d = OrderedDict((rows[2], {"val": rows[1], "flag": False}) for rows in reader)
with open('data1.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader) #skip header
for rows in reader:
if rows[0] in d:
d[rows[0]]["flag"] = True
import sys
sys.stdout = open("rs_pos_ref_alt.csv", "w")
for k, v in d.iteritems():
if v["flag"]:
print [v["val"], k]
这篇关于使用2个csv文件创建的字典运行代码时,Python被杀死:9的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!