pandas :获取重复的索引 [英] Pandas: Get duplicated indexes

查看:90
本文介绍了 pandas :获取重复的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个数据框,我想获取重复的索引,这些索引在各列中没有重复的值,并查看哪些值不同.

Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.

具体地说,我有这个数据框:

Specifically, I have this dataframe:

import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)

In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False

有些索引在第9列中具有重复的值(此位置的DNA重复元件的类型),我想知道各个位置的重复元件的不同类型是什么(每个索引=基因组位置) ).

And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).

我猜想这将需要某种groupby,希望某些groupby忍者可以帮助我.

I'm guessing this will require some kind of groupby and hopefully some groupby ninja can help me out.

为进一步简化,如果我们只有索引和重复类型,

To simplify even further, if we only have the index and the repeat type,

genome_location1    MIR3
genome_location1    AluJb
genome_location2    Tigger1
genome_location3    AT_rich

因此,我想查看所有重复索引及其重复类型的输出,例如:

So the output I'd like to see all duplicate indexes and their repeat types, as such:

genome_location1    MIR3
genome_location1    AluJb

添加了玩具示例

推荐答案

df.groupby(level=0).filter(lambda x: len(x) > 1)['type']

我们为这种操作添加了filter方法.您也可以使用遮罩和变换来获得等效的结果,但这更快,可读性也更高.

We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.

重要提示:

filter方法是在0.12版中引入的,但是在具有非唯一索引的DataFrames/Series上无法使用.该问题以及与系列上的transform相关的问题已针对版本0.13进行了修复,该版本应立即发布.

The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.

很显然,非唯一索引是此问题的核心,因此我应该指出,除非您的熊猫为0.13,否则这种方法将无济于事.同时,transform解决方法是解决方法.请注意,如果您在具有非唯一索引的 Series 上尝试该操作,也会失败.

Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.

没有充分的理由说明为什么不应该将filtertransform应用于非唯一索引;一开始只是实施不力.

There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.

这篇关于 pandas :获取重复的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆