扰乱奇怪的行为/ bug在Python itertools groupby? [英] Disturbing odd behavior/bug in Python itertools groupby?

查看:171
本文介绍了扰乱奇怪的行为/ bug在Python itertools groupby?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用 itertools.groupby 来解析一个简短的制表符分隔的文本文件。文本文件有多个列,我想做的是在特定列中对所有具有特定值 x 的条目进行分组。下面的代码为一个名为 name2 的列查找变量 x 中的值。我试图这样做使用 csv.DictReader itertools.groupby 。在表格中,有符合此条件的 8 行,因此应返回8个条目。而 groupby 会返回两组条目,一个条目为单个条目,另一个条目为7,这看起来像是错误的行为。我在下面手动匹配相同的数据并获得正确的结果:

I am using itertools.groupby to parse a short tab-delimited textfile. the text file has several columns and all I want to do is group all the entries that have a particular value x in a particular column. The code below does this for a column called name2, looking for the value in variable x. I tried to do this using csv.DictReader and itertools.groupby. In the table, there are 8 rows that match this criteria so 8 entries should be returned. Instead groupby returns two sets of entries, one with a single entry and another with 7, which seems like the wrong behavior. I do the matching manually below on the same data and get the right result:

import itertools, operator, csv
col_name = "name2"
x = "ENSMUSG00000002459"
print "looking for entries with value %s in column %s" %(x, col_name)
print "groupby gets it wrong: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
    if name == "ENSMUSG00000002459":
        wrong_result = [e for e in entries]
        print "wrong result has %d entries" %(len(wrong_result))
print "manually grouping entries is correct: "
data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
correct_result = []
for row in data:
    if row[col_name] == "ENSMUSG00000002459":
        correct_result.append(row)
print "correct result has %d entries" %(len(correct_result))

我得到的输出是:

looking for entries with value ENSMUSG00000002459 in column name2
groupby gets it wrong: 
wrong result has 7 entries
wrong result has 1 entries
manually grouping entries is correct: 
correct result has 8 entries

这里是怎么回事?如果 groupby 真的是分组,看起来像我应该只获得一组条目每 x ,而是返回二。我不能想出这一点。 编辑:啊,它应该排序。

what is going on here? If groupby is really grouping, it seems like I should only get one set of entries per x, but instead it returns two. I cannot figure this out. EDIT: Ah got it it should be sorted.

推荐答案

您将要更改代码以强制数据按密钥顺序排列...

You're going to want to change your code to force the data to be in key order...

data = csv.DictReader(open(f), delimiter="\t", fieldnames=fieldnames)
sorted_data = sorted(data, key=operator.itemgetter(col_name))
for name, entries in itertools.groupby(data, key=operator.itemgetter(col_name)):
    pass # whatever

主要用途是当数据集很大,而且数据已经按键顺序,所以当你有排序无论如何,然后使用 defaultdict 更有效

The main use though, is when the datasets are large, and the data is already in key order, so when you have to sort anyway, then using a defaultdict is more efficient

from collections import defaultdict
name_entries = defaultdict(list)
for row in data:
    name_entries[row[col_name]].append(row)

这篇关于扰乱奇怪的行为/ bug在Python itertools groupby?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆