pandas :计算数据框中的唯一值 [英] Pandas: Counting unique values in a dataframe

查看:78
本文介绍了 pandas :计算数据框中的唯一值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个看起来像这样的DataFrame:

We have a DataFrame that looks like this:

> df.ix[:2,:10]
    0   1   2   3   4   5   6   7   8   9   10
0   NaN NaN NaN NaN  6   5  NaN NaN  4  NaN  5
1   NaN NaN NaN NaN  8  NaN NaN  7  NaN NaN  5
2   NaN NaN NaN NaN NaN  1  NaN NaN NaN NaN NaN

我们只想要DataFrame中所有唯一值的计数.一个简单的解决方案是:

We simply want the counts of all unique values in the DataFrame. A simple solution is:

df.stack().value_counts() 

但是: 1.看起来stack返回一个副本,而不是一个视图,在这种情况下,这是禁止使用的.这样对吗? 2.我想按行对DataFrame进行分组,然后为每个分组获取不同的直方图.如果我们忽略stack的内存问题并立即使用它,那么如何正确分组呢?

However: 1. It looks like stack returns a copy, not a view, which is memory prohibitive in this case. Is this correct? 2. I want to group the DataFrame by rows, and then get the different histograms for each grouping. If we ignore the memory issues with stack and use it for now, how does one do the grouping correctly?

d = pd.DataFrame([[nan, 1, nan, 2, 3],
              [nan, 1, 1, 1, 3],
              [nan, 1, nan, 2, 3],
              [nan,2,2,2, 3]])

len(d.stack()) #14
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length

堆叠的DataFrame具有MultiIndex,其长度小于n_rows*n_columns,因为nan被删除了.

The stacked DataFrame has a MultiIndex, with a length of some number less than n_rows*n_columns, because the nans are removed.

0  1    1
   3    2
   4    3
1  0    1
   1    1
   2    1
   3    1
   4    3
    ....

这意味着我们不容易知道如何建立分组.最好只在第一个级别上进行操作,但是接下来我将继续讨论如何应用我实际想要的分组.

This means we don't easily know how to build our grouping. It would be much better to just operate on the first level, but then I'm stuck on how to then apply the grouping I actually want.

d.stack().groupby(level=0).groupby(list('aabb'))
KeyError: 'a'

一种不使用堆叠的解决方案:

A solution, which doesn't use stacking:

f = lambda x: pd.value_counts(x.values.ravel())
d.groupby(list('aabb')).apply(f)
a  1    4
   3    2
   2    1
b  2    4
   3    2
   1    1
dtype: int64

虽然看起来笨拙.如果有更好的选择,我很高兴听到.

Looks clunky, though. If there's a better option I'm happy to hear it.

Dan的评论显示我有错别字,尽管更正仍不能使我们达到终点.

Dan's comment revealed I had a typo, though correcting that still doesn't get us to the finish line.

推荐答案

我认为您正在执行行/列操作,因此可以使用apply:

I think you are doing a row/column-wise operation so can use apply:

In [11]: d.apply(pd.Series.value_counts, axis=1).fillna(0)
Out[11]: 
   1  2  3
0  1  1  1
1  4  0  1
2  1  1  1
3  0  4  1

注意:0.14中有一个value_counts DataFrame方法...将使此方法更加有效和简洁.

Note: There is a value_counts DataFrame method in the works for 0.14... which will make this more efficient and more concise.

值得注意的是,熊猫value_counts函数也可以在numpy数组上使用,因此您可以使用 np.ravel ):

It's worth noting that the pandas value_counts function also works on a numpy array, so you can pass it the values of the DataFrame (as a 1-d array view using np.ravel):

In [21]: pd.value_counts(d.values.ravel())
Out[21]: 
2    6
1    6
3    4
dtype: int64

此外,您几乎可以正确地解决此问题,但是您需要堆叠和卸载:

Also, you were pretty close to getting this correct, but you'd need to stack and unstack:

In [22]: d.stack().groupby(level=0).apply(pd.Series.value_counts).unstack().fillna(0)
Out[22]: 
   1  2  3
0  1  1  1
1  4  0  1
2  1  1  1
3  0  4  1

此错误似乎有点不言自明(4!= 16):

This error seems somewhat self explanatory (4 != 16):

len(d.stack()) #16
d.stack().groupby(arange(4))
AssertionError: Grouper and axis must be same length

也许您想通过:

In [23]: np.repeat(np.arange(4), 4)
Out[23]: array([0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3])

这篇关于 pandas :计算数据框中的唯一值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆