列标题前缀上的GroupBy列 [英] GroupBy columns on column header prefix

查看:57
本文介绍了列标题前缀上的GroupBy列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据列,其列名以一组前缀开头.我想获取以相同前缀开头的列分组的数据框中的值之和.

I have a dataframe with column names that start with a set list of prefixes. I want to get the sum of the values in the dataframe grouped by columns that start with the same prefix.

df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]],
              columns=['abc', 'abd', 'wxy', 'wxz'])
prefixes = ['ab','wx']
df
    abc abd wxy wxz
0   1   2   3   4
1   1   2   3   4
2   1   2   3   4
3   1   2   3   4

我唯一能弄清楚如何做的方法是循环访问前缀列表,从以该字符串开头的数据框中获取列,然后对结果求和.

The only way I could figure out how to do it was to loop through the prefix list, get the columns from the dataframe that startwith that string, and then sum the results.

results = []
for p in prefixes:
  results.append([p, df.loc[:, df.columns.str.startswith(p)].values.sum()])
results = pd.DataFrame(results,)
results.set_index(keys=[0], drop=True).T

    ab  wx
1   12  28

我希望有一个更优雅的方法,也许可以使用groupby(),但是我无法弄清楚.

I hoped there was a more elegant way to do it, perhaps with groupby(), but I couldn't figure it out.

推荐答案

首先,有必要确定哪些列包含哪些前缀.然后,我们使用它执行 groupby .

First, it is necessary to determine what columns contain what prefix. We then use this to perform a groupby.

grouper = [next(p for p in prefixes if p in c) for c in df.columns]
u = df.groupby(grouper, axis=1).sum()

   ab  wx
0   3   7
1   3   7
2   3   7
3   3   7

现在快要到了

u.sum().to_frame().T

   ab  wx
0  12  28


另一个选择是使用 np.char.startswith argmax 进行矢量化:

idx = np.char.startswith(
    df.columns.values[:, None].astype(str), prefixes).argmax(1)

(pd.Series(df.groupby(idx, axis=1).sum().sum().values, index=prefixes)
   .to_frame()
   .transpose())

   ab  wx
0  12  28

这篇关于列标题前缀上的GroupBy列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆