如何chunk csv(dict)读者对象在python 3.2? [英] how to chunk a csv (dict)reader object in python 3.2?
问题描述
我尝试使用多处理模块中的Pool来加速大型csv文件的读取。为此,我调整了示例(来自py2k),但它似乎csv.dictreader对象没有长度。这是否意味着我只能迭代它?
I try to use Pool from the multiprocessing module to speed up reading in large csv files. For this, I adapted an example (from py2k), but it seems like the csv.dictreader object has no length. Does it mean I can only iterate over it? Is there a way to chunk it still?
这些问题似乎相关,但没有真正回答我的问题:
csv.DictReader 中的行数,
如何在Python 3中列出列表?
These questions seemed relevant, but did not really answer my question: Number of lines in csv.DictReader, How to chunk a list in Python 3?
代码试图这样做:
source = open('/scratch/data.txt','r')
def csv2nodes(r):
strptime = time.strptime
mktime = time.mktime
l = []
ppl = set()
for row in r:
cell = int(row['cell'])
id = int(row['seq_ei'])
st = mktime(strptime(row['dat_deb_occupation'],'%d/%m/%Y'))
ed = mktime(strptime(row['dat_fin_occupation'],'%d/%m/%Y'))
# collect list
l.append([(id,cell,{1:st,2: ed})])
# collect separate sets
ppl.add(id)
return (l,ppl)
def csv2graph(source):
r = csv.DictReader(source,delimiter=',')
MG=nx.MultiGraph()
l = []
ppl = set()
# Remember that I use integers for edge attributes, to save space! Dic above.
# start: 1
# end: 2
p = Pool(processes=4)
node_divisor = len(p._pool)*4
node_chunks = list(chunks(r,int(len(r)/int(node_divisor))))
num_chunks = len(node_chunks)
pedgelists = p.map(csv2nodes,
zip(node_chunks))
ll = []
for l in pedgelists:
ll.append(l[0])
ppl.update(l[1])
MG.add_edges_from(ll)
return (MG,ppl)
推荐答案
从 csv.DictReader
文档(和 csv.reader
类它子类),该类返回一个迭代器。当您调用 len()
时,代码应该抛出 TypeError
。
From the csv.DictReader
documentation (and the csv.reader
class it subclasses), the class returns an iterator. The code should have thrown a TypeError
when you called len()
.
你仍然可以块化数据,但是你必须将它完全读入内存。如果你关心内存,可以从 csv.DictReader
切换到 csv.reader
,并跳过字典 csv.DictReader
创建。要提高 csv2nodes()
中的可读性,可以为每个字段的索引分配常量:
You can still chunk the data, but you'll have to read it entirely into memory. If you're concerned about memory you can switch from csv.DictReader
to csv.reader
and skip the overhead of the dictionaries csv.DictReader
creates. To improve readability in csv2nodes()
, you can assign constants to address each field's index:
CELL = 0
SEQ_EI = 1
DAT_DEB_OCCUPATION = 4
DAT_FIN_OCCUPATION = 5
我还建议使用不同于 id
的变量,因为这是一个内置的函数名。
I also recommend using a different variable than id
, since that's a built-in function name.
这篇关于如何chunk csv(dict)读者对象在python 3.2?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!