将max应用于 pandas 数据帧的不同维度子集 [英] apply max to varying-dimension subsets of pandas dataframe

查看:114
本文介绍了将max应用于 pandas 数据帧的不同维度子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于具有带有重复索引的索引列的数据帧,我试图按索引获取在不同列中找到的最大值,并将其分配给第三列,以便对于任何给定的行,我们可以看到在具有相同索引的任何行中找到的最大值. 我正在对非常大的数据集进行此操作,并希望将其向量化.现在,我根本无法正常工作

For a dataframe with an indexed column with repeated indexes, I'm trying to get the maximum value found in a different column, by index, and assign it to a third column, so that for any given row, we can see the maximum value found in any row with the same index. I'm doing this over a very large data set and would like it to be vectorized if possible. For now, I can't get it to work at all

multiindexDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,7,10,15,11,25,89]]).transpose()
multiindexDF.columns = ['theIndex','theValue']
multiindexDF['maxValuePerIndex'] = 0
uniqueIndicies = multiindexDF['theIndex'].unique()
for i in uniqueIndices:
    matchingIndices = multiindexDF['theIndex'] == i
    maxValue = multiindexDF[matchingIndices == i]['theValue'].max()
    multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue

此操作失败,告诉我应该已经使用.loc了.不确定错误的含义,也不确定如何解决此问题,因此不必遍历所有内容,而是可以对其进行矢量化

This fails, telling me I should use .loc, when I'm already using it. Not sure what the error means, and not sure how I can fix this so I don't have to loop through everything so I can vectorize it instead

我正在寻找

targetDF = pd.DataFrame([[1,2,3,3,4,4,4,4],[5,6,10,7,15,11,25,89],[5,6,10,10,89,89,89,89]]).transpose()
targetDF

推荐答案

这似乎是groupby转换的一个好例子,它可以获取每个索引组的最大值并将它们转换回其原始索引(而不是分组的索引).索引):

Looks like this is a good case for groupby transform, this can get the maximum value per index group and transform them back onto their original index (rather than the grouped index):

multiindexDF['maxValuePerIndex'] = multiindexDF.groupby("theIndex")["theValue"].transform("max")

得到SettingWithCopyWarning的原因是,在.loc调用中,您要获取一个切片的切片并在其中设置值,请参见以下两对方括号:

The reason you're getting the SettingWithCopyWarning is that in your .loc call you're taking a slice of a slice and setting the value there, see the two pair of square brackets in:

multiindexDF.loc[matchingIndices]['maxValuePerIndex'] = maxValue

因此,它尝试将值分配给切片而不是原始DataFrame,您正在执行.loc,然后在链中执行另一个[].

So it tries to assign the value to the slice rather than the original DataFrame, you're doing a .loc and then another [] after it in a chain.

因此使用您的原始方法:

So using your original approach:

for i in uniqueIndices:
    matchingIndices = multiindexDF['theIndex'] == i
    maxValue = multiindexDF.loc[matchingIndices, 'theValue'].max()
    multiindexDF.loc[matchingIndices, 'maxValuePerIndex'] = maxValue

(注意,我还更改了第一个.loc,其中您错误地使用了布尔索引)

(Notice I've also changed the first .loc where you were incorrectly using the boolean index)

这篇关于将max应用于 pandas 数据帧的不同维度子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆