pandas :当列包含numpy数组时聚合 [英] Pandas: aggregate when column contains numpy arrays

查看:200
本文介绍了 pandas :当列包含numpy数组时聚合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用pandas DataFrame,其中一列包含numpy数组.尝试通过聚合求和该列时,出现错误,指出必须产生聚合值".

I'm using a pandas DataFrame in which one column contains numpy arrays. When trying to sum that column via aggregation I get an error stating 'Must produce aggregated value'.

例如

import pandas as pd
import numpy as np

DF = pd.DataFrame([[1,np.array([10,20,30])],
               [1,np.array([40,50,60])], 
               [2,np.array([20,30,40])],], columns=['category','arraydata'])

这可以达到我期望的方式:

This works the way I would expect it to:

DF.groupby('category').agg(sum)

输出:

             arraydata
category 1   [50 70 90]
         2   [20 30 40]

但是,由于我的真实数据框具有多个数字列,因此未选择arraydata作为要在其上进行聚合的默认列,因此我必须手动选择它.这是我尝试过的一种方法:

However, since my real data frame has multiple numeric columns, arraydata is not chosen as the default column to aggregate on, and I have to select it manually. Here is one approach I tried:

g=DF.groupby('category')
g.agg({'arraydata':sum})

这是另一个:

g=DF.groupby('category')
g['arraydata'].agg(sum)

两者都给出相同的输出:

Both give the same output:

Exception: must produce aggregated value

但是,如果我有一列使用数字而不是数组数据,则可以正常工作.我可以解决此问题,但这很令人困惑,我想知道这是否是错误,或者我做错了什么.我觉得在这里使用数组可能有点麻烦,确实不确定是否支持它们.想法?

However if I have a column that uses numeric rather than array data, it works fine. I can work around this, but it's confusing and I'm wondering if this is a bug, or if I'm doing something wrong. I feel like the use of arrays here might be a bit of an edge case and indeed wasn't sure if they were supported. Ideas?

谢谢

推荐答案

一种可能比较笨拙的方法是遍历GroupBy对象(它生成(grouping_value, df_subgroup)元组.例如,实现您想要在这里,您可以这样做:

One, perhaps more clunky way to do it would be to iterate over the GroupBy object (it generates (grouping_value, df_subgroup) tuples. For example, to achieve what you want here, you could do:

grouped = DF.groupby("category")
aggregate = list((k, v["arraydata"].sum()) for k, v in grouped)
new_df = pd.DataFrame(aggregate, columns=["category", "arraydata"]).set_index("category")

这与大熊猫在幕后所做的工作非常相似[分组,然后进行一些聚合,然后再合并],因此您并不会损失太多.

This is very similar to what pandas is doing under the hood anyways [groupby, then do some aggregation, then merge back in], so you aren't really losing out on much.

这里的问题是pandas正在明确检查输出 not 是否为ndarray,因为它想智能地重塑数组,如您在_aggregate_named中的这段代码中所见错误发生.

The problem here is that pandas is checking explicitly that the output not be an ndarray because it wants to intelligently reshape your array, as you can see in this snippet from _aggregate_named where the error occurs.

def _aggregate_named(self, func, *args, **kwargs):
    result = {}

    for name, group in self:
        group.name = name
        output = func(group, *args, **kwargs)
        if isinstance(output, np.ndarray):
            raise Exception('Must produce aggregated value')
        result[name] = self._try_cast(output, group)

    return result

我的猜测是,发生这种情况是因为明确设置了groupby,以尝试智能地将具有相同索引的DataFrame重新组合在一起,并且所有内容都很好地对齐了.由于这样的DataFrame中很少有嵌套数组,因此它会检查ndarray以确保您实际上正在使用聚合函数.在我的直觉中,这感觉像是Panel的工作,但是我不确定如何完美地进行转换.顺便说一句,您可以通过将输出转换为列表来避免此问题,如下所示:

My guess is that this happens because groupby is explicitly set up to try to intelligently put back together a DataFrame with the same indexes and everything aligned nicely. Since it's rare to have nested arrays in a DataFrame like that, it checks for ndarrays to make sure that you are actually using an aggregate function. In my gut, this feels like a job for Panel, but I'm not sure how to transform it perfectly. As an aside, you can sidestep this problem by converting your output to a list, like this:

DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})

Pandas不会抱怨,因为现在您有了一个Python对象数组. [但这实际上只是在类型检查中作弊].如果要转换回数组,只需对它应用np.array.

Pandas doesn't complain, because now you have an array of Python objects. [but this is really just cheating around the typecheck]. And if you want to convert back to array, just apply np.array to it.

result = DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
result["arraydata"] = result["arraydata"].apply(np.array)

您要如何解决此问题,实际上取决于为什么您有ndarray列,以及是否要同时聚合其他任何内容.也就是说,您总是可以像上面显示的那样遍历GroupBy.

How you want to resolve this issue really depends on why you have columns of ndarray and whether you want to aggregate anything else at the same time. That said, you can always iterate over GroupBy like I've shown above.

这篇关于 pandas :当列包含numpy数组时聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆