扰乱奇怪的行为/ bug在Python itertools groupby? [英] Disturbing odd behavior/bug in Python itertools groupby?
问题描述
我使用 itertools.groupby
来解析一个简短的制表符分隔的文本文件。文本文件有多个列,我想做的是在特定列中对所有具有特定值 x
的条目进行分组。下面的代码为一个名为 name2
的列查找变量 x
中的值。我试图这样做使用 csv.DictReader
和 itertools.groupby
。在表格中,有符合此条件的 8 行,因此应返回8个条目。而 groupby
会返回两组条目,一个条目为单个条目,另一个条目为7,这看起来像是错误的行为。我在下面手动匹配相同的数据并获得正确的结果:
I am using itertools.groupby
to parse a short tab-delimited textfile. the text file has several columns and all I want to do is group all the entries that have a particular value x
in a particular column. The code below does this for a column called name2
, looking for the value in variable x
. I tried to do this using csv.DictReader
and itertools.groupby
. In the table, there are 8 rows that match this criteria so 8 entries should be returned. Instead groupby
returns two sets of entries, one with a single entry and another with 7, which seems like the wrong behavior. I do the matching manually below on the same data and get the right result:
import itertools, operator, csv
col_name = "name2"
x = "ENSMUSG00000002459"
print "looking for entries with value %s in column %s" %(x, col_name)
print "groupby gets it wrong: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
if name == "ENSMUSG00000002459":
wrong_result = [e for e in entries]
print "wrong result has %d entries" %(len(wrong_result))
print "manually grouping entries is correct: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
correct_result = []
for row in data:
if row[col_name] == "ENSMUSG00000002459":
correct_result.append(row)
print "correct result has %d entries" %(len(correct_result))
我得到的输出是:
looking for entries with value ENSMUSG00000002459 in column name2
groupby gets it wrong:
wrong result has 7 entries
wrong result has 1 entries
manually grouping entries is correct:
correct result has 8 entries
这里是怎么回事?如果 groupby
真的是分组,看起来像我应该只获得一组条目每 x
,而是返回二。我不能想出这一点。 编辑:啊,它应该排序。
what is going on here? If groupby
is really grouping, it seems like I should only get one set of entries per x
, but instead it returns two. I cannot figure this out. EDIT: Ah got it it should be sorted.
推荐答案
您将要更改代码以强制数据按密钥顺序排列...
You're going to want to change your code to force the data to be in key order...
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
sorted_data = sorted(data, key=operator.itemgetter(col_name))
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
pass # whatever
主要用途是当数据集很大,而且数据已经按键顺序,所以当你有排序无论如何,然后使用 defaultdict
更有效
The main use though, is when the datasets are large, and the data is already in key order, so when you have to sort anyway, then using a defaultdict
is more efficient
from collections import defaultdict
name_entries = defaultdict(list)
for row in data:
name_entries[row[col_name]].append(row)
这篇关于扰乱奇怪的行为/ bug在Python itertools groupby?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!