以pandas数据框中的相同字符串开头的列的总和 [英] sum values of columns starting with the same string in pandas dataframe

查看:204
本文介绍了以pandas数据框中的相同字符串开头的列的总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大约100列的数据框,如下所示:

I have a dataframe with about 100 columns that looks like this:

   Id  Economics-1  English-107  English-2  History-3  Economics-zz  Economics-2  \
0  56          1            1          0        1       0           0   
1  11          0            0          0        0       1           0   
2   6          0            0          1        0       0           1   
3  43          0            0          0        1       0           1   
4  14          0            1          0        0       1           0   

   Histo      Economics-51      Literature-re         Literatureu4  
0           1            0           1                0  
1           0            0           0                1  
2           0            0           0                0  
3           0            1           1                0  
4           1            0           0                0  

我的目标是只保留全局类别-英语,历史,文学-并在此数据框中分别写入其组成部分的总和.例如,英语"将是英语107"和英语2"的总和:

My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":

    Id  Economics      English    History  Literature  
0  56          1            1          2        1                     
1  11          1            0          0        1                    
2   6          0            1          1        0                     
3  43          2            0          1        1                     
4  14          0            1          1        0          

为此,我尝试了两种方法.第一种方法:

For this purpose, I have tried two methods. First method:

df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]

第二种方法:

df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
    df['History'] = df[filter_col].sum(axes=1)
    print df['History', df[filter_col]]

但是,两者都给出错误:

However, both gives the error:

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

我的问题是:我该如何调试此错误,或者有其他解决方案来解决我的问题.请注意,我有一个相当大的数据框,其中包含约100列和400000行,因此我正在寻找一种优化的解决方案,例如在熊猫中使用loc.

My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.

推荐答案

我建议您做一些不同的事情,即执行转置,将行的前缀(原始列),总和和转置分​​组.再次.

I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.

请考虑以下内容:

df = pd.DataFrame({
        'a_a': [1, 2, 3, 4],
        'a_b': [2, 3, 4, 5],
        'b_a': [1, 2, 3, 4],
        'b_b': [2, 3, 4, 5],
    })

现在

[s.split('_')[0] for s in df.T.index.values]

是列的前缀.所以

>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
    a   b
0   3   3
1   5   5
2   7   7
3   9   9

做您想要的事.

根据您的情况,请确保使用'-'字符进行分割.

In your case, make sure to split using the '-' character.

这篇关于以pandas数据框中的相同字符串开头的列的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆