以pandas数据框中的相同字符串开头的列的总和 [英] sum values of columns starting with the same string in pandas dataframe
问题描述
我有一个大约100列的数据框,如下所示:
I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
我的目标是只保留全局类别-英语,历史,文学-并在此数据框中分别写入其组成部分的总和.例如,英语"将是英语107"和英语2"的总和:
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
为此,我尝试了两种方法.第一种方法:
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
第二种方法:
df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History'] = df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
但是,两者都给出错误:
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed
我的问题是:我该如何调试此错误,或者有其他解决方案来解决我的问题.请注意,我有一个相当大的数据框,其中包含约100列和400000行,因此我正在寻找一种优化的解决方案,例如在熊猫中使用loc
.
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc
in pandas.
推荐答案
我建议您做一些不同的事情,即执行转置,将行的前缀(原始列),总和和转置分组.再次.
I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
请考虑以下内容:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
现在
[s.split('_')[0] for s in df.T.index.values]
是列的前缀.所以
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
a b
0 3 3
1 5 5
2 7 7
3 9 9
做您想要的事.
根据您的情况,请确保使用'-'
字符进行分割.
In your case, make sure to split using the '-'
character.
这篇关于以pandas数据框中的相同字符串开头的列的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!