每组每列的唯一值数量 [英] Number of unique values per column by group
问题描述
请考虑以下数据框:
A B E
0 bar one 1
1 bar three 1
2 flux six 1
3 flux three 2
4 foo five 2
5 foo one 1
6 foo two 1
7 foo two 2
我想为A
的每个值找到其他列中唯一值的数量.
I would like to find, for each value of A
, the number of unique values in the other columns.
-
我认为以下可以做到:
I thought the following would do it:
df.groupby('A').apply(lambda x: x.nunique())
但是我得到一个错误:
AttributeError: 'DataFrame' object has no attribute 'nunique'
我也尝试过:
I also tried with:
df.groupby('A').nunique()
但是我也得到了错误:
AttributeError: 'DataFrameGroupBy' object has no attribute 'nunique'
最后我尝试了:
Finally I tried with:
df.groupby('A').apply(lambda x: x.apply(lambda y: y.nunique()))
返回:
A B E
A
bar 1 2 1
flux 1 2 2
foo 1 3 2
,似乎是正确的.但是奇怪的是,它也在结果中返回列A
.为什么?
and seems to be correct. Strangely though, it also returns the column A
in the result. Why?
推荐答案
DataFrame
对象没有nunique
,只有Series
有.您必须选择要在nunique()
上应用的列.您可以使用简单的点运算符来做到这一点:
The DataFrame
object doesn't have nunique
, only Series
do. You have to pick out which column you want to apply nunique()
on. You can do this with a simple dot operator:
df.groupby('A').apply(lambda x: x.B.nunique())
将打印:
A
bar 2
flux 2
foo 3
并且正在做
df.groupby('A').apply(lambda x: x.E.nunique())
将打印:
A
bar 1
flux 2
foo 2
或者,您可以使用以下方法通过一个函数调用来完成此操作:
Alternatively you can do this with one function call using:
df.groupby('A').aggregate({'B': lambda x: x.nunique(), 'E': lambda x: x.nunique()})
将打印:
B E
A
bar 2 1
flux 2 2
foo 3 2
要回答有关为什么递归lambda还要打印A
列的问题,这是因为当您执行groupby
/apply
操作时,现在要遍历三个DataFrame
对象.每个DataFrame
对象都是原始对象的子DataFrame
.对它应用操作将把它应用于每个Series
.您将nunique()
运算符应用于的每个DataFrame
有三个Series
.
To answer your question about why your recursive lambda prints the A
column as well, it's because when you do a groupby
/apply
operation, you're now iterating through three DataFrame
objects. Each DataFrame
object is a sub-DataFrame
of the original. Applying an operation to that will apply it to each Series
. There are three Series
per DataFrame
you're applying the nunique()
operator to.
在每个DataFrame
上被评估的第一个Series
是A
Series
,并且由于您已经在A
上进行了groupby
,因此您知道在每个DataFrame
中都有A
Series
中只有一个唯一值.这就解释了为什么最终会为您提供带有所有1
的A
结果列.
The first Series
being evaluated on each DataFrame
is the A
Series
, and since you've done a groupby
on A
, you know that in each DataFrame
, there is only one unique value in the A
Series
. This explains why you're ultimately given an A
result column with all 1
's.
这篇关于每组每列的唯一值数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!