什么时候适合使用df.value_counts()和df.groupby('...').count()? [英] When is it appropriate to use df.value_counts() vs df.groupby('...').count()?
问题描述
我在Pandas听说过,通常有多种方法可以做同一件事,但是我想知道–
I've heard in Pandas there's often multiple ways to do the same thing, but I was wondering –
如果我要按特定列中的值对数据进行分组并计算具有该值的项目数,那么什么时候使用df.groupby('colA').count()
有意义,什么时候使用df['colA'].value_counts()
有意义?
If I'm trying to group data by a value within a specific column and count the number of items with that value, when does it make sense to use df.groupby('colA').count()
and when does it make sense to use df['colA'].value_counts()
?
推荐答案
There is difference value_counts
return:
生成的对象将按降序排列,以便第一个元素是出现频率最高的元素.
The resulting object will be in descending order so that the first element is the most frequently-occurring element.
但是 count
不是,它按index
(由groupby('col')
中的列创建)对输出进行排序.
but count
not, it sort output by index
(created by column in groupby('col')
).
df.groupby('colA').count()
用于按功能count.
汇总df
的所有列,因此它计算不包括NaN
s的值.
is for aggregate all columns of df
by function count.
So it count values excluding NaN
s.
因此,如果需要count
,则只需要一列:
So if need count
only one column need:
df.groupby('colA')['colA'].count()
示例:
df = pd.DataFrame({'colB':list('abcdefg'),
'colC':[1,3,5,7,np.nan,np.nan,4],
'colD':[np.nan,3,6,9,2,4,np.nan],
'colA':['c','c','b','a',np.nan,'b','b']})
print (df)
colA colB colC colD
0 c a 1.0 NaN
1 c b 3.0 3.0
2 b c 5.0 6.0
3 a d 7.0 9.0
4 NaN e NaN 2.0
5 b f NaN 4.0
6 b g 4.0 NaN
print (df['colA'].value_counts())
b 3
c 2
a 1
Name: colA, dtype: int64
print (df.groupby('colA').count())
colB colC colD
colA
a 1 1 1
b 3 2 2
c 2 2 1
print (df.groupby('colA')['colA'].count())
colA
a 1
b 3
c 2
Name: colA, dtype: int64
这篇关于什么时候适合使用df.value_counts()和df.groupby('...').count()?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!