pandas -从df中识别独特的三胞胎 [英] Pandas - identify unique triplets from a df
问题描述
我有一个代表唯一项目的数据框.每个项目都由一组varA
,varB
和varC
唯一标识(因此,每个项目的varA
,varB
或varC
都有0到n个值).我的df每个唯一商品都有多个Raw,并带有varA
,varB
和varC
的各种组合.
I have a dataframe which represents unique items. Each item is uniquely identified by a set of varA
, varB
, and varC
(so each item has 0 to n values for varA
, varB
, or varC
). My df has multiple raws per unique item, with various combination of varA
, varB
, and varC
.
df就是这样(ID
在该列中是唯一的,但并不代表唯一的项).
The df is like this (ID
is unique in the column, but it doesn't represent the unique item).
df = pd.DataFrame({'ID':[1,2,3,4,5],
'varA':['a', 'd', 'a', 'm','Z'],
'varB':['b', 'e', 'k', 'e',NaN],
'varC':['c', 'f', 'l', NaN ,'t']})
所以在这里的df中,您可以看到:
So in the df here, you can see that:
- 1和3是相同的项目,具有:{varA:[a],varB:[b,k],varC:[c,l]}.
- 2和4也相同:{varA:[d,m],varB:[e],varC:[f]}
我想识别每个唯一的商品,给他们一个唯一的ID,并存储他们的信息.
I would like to identify every unique item, give them a unique id, and store their information.
我编写的代码效率极低:
The code I have written is terribly inefficient:
- Step1 :我遍历数据框的每一行并列出每个变量
- 当这三个变量是新变量时,这是一个新项目,我为其指定了ID.
- 当知道其中一个变量时,我会将新变量存储在它们各自的列表中,然后继续移至下一行
- Step1: I walk through each row of the dataframe and make a list of each variable
- When the three variables are new, it's a new item and I give it an id.
- When either of the variable is know, I store the new ones in their respective list and keep walking to the next row
- 1个具有唯一ID
- 另一个没有唯一ID,但是可以在具有唯一ID的信息中找到其信息,可以使用
varA
,varB
或varC
.因此,很丑陋,我先后合并了两个变量,然后找到了唯一的ID.
- 1 with a unique id,
- the other one without unique id, but whose information can be found in the ones that have unique id, either with
varA
,varB
, orvarC
. So quite uglily I merge successively on either variable, and find the unique id.
这对于在
varA
和varB
中输入20,000行有效.在运行10万行之前(在步骤1和步骤2之间),它运行非常缓慢并且快要死掉了,我需要在1,000,000行上达到它.This works well with 20,000 rows in entry with
varA
andvarB
. This is running very slow and dying before the end (between Step1 and Step2) on 100,000 rows, and I need to make it on 1,000,000 rows.任何潘达尼克方式可以做到这一点吗?
Any pandanique way of doing this?
推荐答案
您可以使用
duplicated
(如果要保留第一次出现的重复项:If you want to keep the first occurence of a duplicated:
myfilter = ~df.varA.duplicated(keep='first') & \ ~df.varB.duplicated(keep='first') & \ ~df.varC.duplicated(keep='first')
如果您不想
myfilter = ~df.varA.duplicated(keep=False) & \ ~df.varB.duplicated(keep=False) & \ ~df.varC.duplicated(keep=False)
然后,您可以为它们提供一个增量的uniqueID:
Then you can for example give these an incremental uniqueID:
df.ix[myfilter, 'uniqueID'] = np.arange(myfilter.sum(), dtype='int') df ID varA varB varC uniqueID 0 1 a b c 0.0 1 2 d e f 1.0 2 3 a k l NaN 3 4 m e NaN NaN 4 5 Z NaN t 2.0
这篇关于 pandas -从df中识别独特的三胞胎的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!