pandas -从df中识别独特的三胞胎 [英] Pandas - identify unique triplets from a df

查看：79 发布时间：2020/5/18 23:11:18 python-2.7 pandas numpy

本文介绍了 pandas -从df中识别独特的三胞胎的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个代表唯一项目的数据框.每个项目都由一组varA，varB和varC唯一标识(因此，每个项目的varA，varB或varC都有0到n个值).我的df每个唯一商品都有多个Raw，并带有varA，varB和varC的各种组合.

I have a dataframe which represents unique items. Each item is uniquely identified by a set of varA, varB, and varC (so each item has 0 to n values for varA, varB, or varC). My df has multiple raws per unique item, with various combination of varA, varB, and varC.

df就是这样(ID在该列中是唯一的，但并不代表唯一的项).

The df is like this (ID is unique in the column, but it doesn't represent the unique item).

df = pd.DataFrame({'ID':[1,2,3,4,5],
                   'varA':['a', 'd', 'a', 'm','Z'],
                   'varB':['b', 'e', 'k', 'e',NaN],
                   'varC':['c', 'f', 'l', NaN ,'t']})

所以在这里的df中，您可以看到:

So in the df here, you can see that:

1和3是相同的项目，具有:{varA:[a]，varB:[b，k]，varC:[c，l]}.
2和4也相同:{varA:[d，m]，varB:[e]，varC:[f]}

我想识别每个唯一的商品，给他们一个唯一的ID，并存储他们的信息.

I would like to identify every unique item, give them a unique id, and store their information.

我编写的代码效率极低:

The code I have written is terribly inefficient:

Step1 :我遍历数据框的每一行并列出每个变量
- 当这三个变量是新变量时，这是一个新项目，我为其指定了ID.
- 当知道其中一个变量时，我会将新变量存储在它们各自的列表中，然后继续移至下一行
- Step1: I walk through each row of the dataframe and make a list of each variable
  - When the three variables are new, it's a new item and I give it an id.
  - When either of the variable is know, I store the new ones in their respective list and keep walking to the next row
  - 1个具有唯一ID
  - 另一个没有唯一ID，但是可以在具有唯一ID的信息中找到其信息，可以使用varA，varB或varC.因此，很丑陋，我先后合并了两个变量，然后找到了唯一的ID.
  - 1 with a unique id,
  - the other one without unique id, but whose information can be found in the ones that have unique id, either with varA, varB, or varC. So quite uglily I merge successively on either variable, and find the unique id.
  这对于在varA和varB中输入20,000行有效.在运行10万行之前(在步骤1和步骤2之间)，它运行非常缓慢并且快要死掉了，我需要在1,000,000行上达到它.
  
  This works well with 20,000 rows in entry with varA and varB. This is running very slow and dying before the end (between Step1 and Step2) on 100,000 rows, and I need to make it on 1,000,000 rows.
  
  任何潘达尼克方式可以做到这一点吗?
  
  Any pandanique way of doing this?
  
  推荐答案
  
  您可以使用duplicated(如果要保留第一次出现的重复项:
  If you want to keep the first occurence of a duplicated:
```
myfilter = ~df.varA.duplicated(keep='first') & \
           ~df.varB.duplicated(keep='first') & \
           ~df.varC.duplicated(keep='first')
```
  如果您不想
```
myfilter = ~df.varA.duplicated(keep=False) & \
           ~df.varB.duplicated(keep=False) & \
           ~df.varC.duplicated(keep=False)
```
  然后，您可以为它们提供一个增量的uniqueID:
  
  Then you can for example give these an incremental uniqueID:
```
df.ix[myfilter, 'uniqueID'] = np.arange(myfilter.sum(), dtype='int')
df


   ID varA varB varC  uniqueID
0   1    a    b    c       0.0
1   2    d    e    f       1.0
2   3    a    k    l       NaN
3   4    m    e  NaN       NaN
4   5    Z  NaN    t       2.0
```
  这篇关于 pandas -从df中识别独特的三胞胎的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas -从df中识别独特的三胞胎 [英] Pandas - identify unique triplets from a df

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas -从df中识别独特的三胞胎 [英] Pandas - identify unique triplets from a df

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭