在MultiIndexed数据框中选择行 [英] Selecting rows in a MultiIndexed dataframe

查看:100
本文介绍了在MultiIndexed数据框中选择行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想分别提取"S"的垃圾箱,其中每列(X&Y)> 0.5,或多个垃圾箱> 0.5 *行数".

I'd like to extract the bins for 'S' individually, where each column (X&Y) > 0.5, or multiple bins > 0.5 * 'number of rows'.

在示例中;

对于"AR1",仅应选择bin 4,因为"X"和"Y"> 0.5(表示蓝色)

for 'AR1' should only bin 4 be selected, because 'X' and 'Y' are > 0.5 (blue indicated)

对于"PO1",应选择bin 1、2、3和4,因为"X"和"Y"的值>(4 * 0.5)(表示黄色).

for 'PO1' should bins 1, 2, 3 and 4 be selected, because 'X' and 'Y' are > (4 * 0.5) (yellow indicated).

我早些时候用for loop尝试过,但是不能正常工作. 有条件地选择多个(相邻)行

I've tried this earlier with a for loop, but that didn't work correctly; Selecting multiple (neighboring) rows conditionally

np.random.seed(0)

N = 20
S = ['AR1', 'PO1']

df = pd.DataFrame(
    {'X':np.random.uniform(-1,1,N),
     'Y':np.random.uniform(-1,1,N),
     'S':np.random.choice(S,N),
    })

df['bins_X'] = df.groupby('S')['X'].apply(pd.qcut, q=5, labels=np.arange(5))    # create bins per column 'S'

def func(df):                                                                   # create function to group per 'S' and their bins
    df1 = df.groupby(['S','bins_X']).sum()
    new_cols= list(zip(df1.columns.get_level_values(0)))
    df1.columns = pd.MultiIndex.from_tuples(new_cols)
    return df1

print func(df)

编辑

它应该是问题中显示的df,但不符合条件的行将被过滤掉.我要检查的是这个;分别或组合任何行(bin)的X和Y中的值> 0.5.仅连续进行行组合,将2、3、4或5行组合.

What it should look like is the df as shown in the question, but the rows that do not qualify filtered out. What I check for is this ; the values in X and Y > 0.5 for any row(bin) separately or combined. Combinations of rows only consecutively, 2, 3, 4 or 5 rows combined.

即,然后将0的行组合; 0 + 1、0 + 1 + 2、0 + 1 + 2 + 3和0 + 1 + 2 + 3 + 4.为1; 1 + 2、1 + 2 + 3和1 + 2 + 3 + 4等

I.e, the combinations of rows for 0 are then ; 0+1, 0+1+2, 0+1+2+3 and 0+1+2+3+4. For 1 ; 1+2, 1+2+3 and 1+2+3+4 etc.

多行总和等于行数x 0.5,例如,对于第0到第4行,X和Y必须大于2.5.

Multiple rows would sum to the number of rows x 0.5, X and Y would have to be > 2.5 for rows 0 to 4 for example.

@JohnE和piRSquared,您的两种解决方案都可以工作,但是当数据框中还有其他不应评估的列时,哪种解决方案会更好?

@JohnE and piRSquared, both your solutions work, which one however would work better when there are other columns in the dataframe that should not be evaluated?

此外,如果我想在您的解决方案中添加其他条件怎么办?

In addition, what if I would want to addin an additional condition in your solutions?

@piRSquared,当对某些列进行子集设置时,我只会得到返回的列,而在这里我需要所有的列,而不仅仅是子集.

@piRSquared, When subsetting some columns I only get those returned, where I would need all of them, not only the subsetted.

您能帮忙吗?谢谢.

推荐答案

这是一种矢量化方法,在顶层(groupby.apply)只有一个循环

This is a vectorized approach with only one loop at the top level (groupby.apply)

# columns that I care about
cols = ['X', 'Y']
df1.groupby(level=0)[cols].apply(find_window)

def find_window(df):
    v = df.values
    s = np.vstack([np.zeros((1, v.shape[1])), v.cumsum(0)])

    threshold = .5

    r, c = np.triu_indices(s.shape[0], 1)
    d = (c - r)[:, None]
    e = s[c] - s[r]
    mask = (e / d > threshold).all(1)
    rng = np.arange(mask.shape[0])

    if mask.any():
        idx = rng[mask][d[mask].argmax()]

        i0, i1 = r[idx], c[idx]
        return pd.DataFrame(
            v[i0:i1],
            df.loc[df.name].index[i0:i1],
            df.columns
        )


说明

策略


Explanation

strategy

  • numpy.triu_indices :我需要评估滚动mean的每个可能窗口(大于某些threshold).我将捕获每个可能的窗口,方法是从位置0到0,然后从0到1,然后...从1到1,从1到2 ...依此类推.但是我必须始终从一个位置开始,然后再结束.我可以使用 numpy.triu_indices .
  • cumsum :要获得由我从np.triu_indices获得的索引的每种组合指定的扩展数组,将有些棘手(可行).更好的方法是计算cumsum并计算一个索引与下一个索引之间的差.
  • 我必须在我的cumsum前面加零,以便我可以将第一行的差值计算在内.
  • 但是总和不是钱.我需要除以行数才能得出平均值.方便地,结束位置和开始位置之间的差正是行数,因此也就是用于将总和除以计算平均值的适当数目.
  • 现在我有了均值e / d,我将检查哪些是> threshold,并确定哪些起始和结束位置组合的均值大于两列的阈值.
  • 然后在均值大于阈值的行中识别出行数最多的组合.
  • 我解开位置并重建一个数据框
  • groupbyapply ... QED
  • numpy.triu_indices: I need to evaluate each possible window for the rolling mean greater than some threshold. I'm going to capture each possible window by starting from position 0 to 0, then 0 to 1 then ... then 1 to 1, 1 to 2 ... so on and so forth. But I must always start at a position before I finish. I can access these combinations with numpy.triu_indices.
  • cumsum: It would be a little tricky (doable) to get the expanded arrays specified by each combination of indices that I get from np.triu_indices. A better way is to calculate the cumsum and take the difference from one index to the next.
  • I have to prepend zeros to my cumsum so that I can take the difference for the first row.
  • But sums are not means. I need to divide by the number of rows to get the means. Conveniently, the difference between the end and start positions is exactly the number of rows and thus the appropriate number to divide the sums by in order to calculate the means.
  • Now that I have the means, e / d, I check which are > threshold and identify which combinations of start and end positions have means greater than the threshold for both columns.
  • I then identify the combination with the greatest number of rows among those that have means greater than the threshold.
  • I unwind the positions and reconstruct a dataframe
  • groupby and apply... QED

时间测试

time test

具有更多数据

with more data

np.random.seed(0)

N = 300
S = ['AR1', 'PO1', 'AR2', 'PO2', 'AR3', 'PO3']

df = pd.DataFrame(
    {'X':np.random.uniform(-1,1,N),
     'Y':np.random.uniform(-1,1,N),
     'S':np.random.choice(S,N),
    })

df['bins_X'] = df.groupby('S')['X'].apply(pd.qcut, q=20, labels=np.arange(20))    # create bins per column 'S'

def func(df):                                                                   # create function to group per 'S' and their bins
    df1 = df.groupby(['S','bins_X']).sum()
    new_cols= list(zip(df1.columns.get_level_values(0)))
    df1.columns = pd.MultiIndex.from_tuples(new_cols)
    return df1

df1 = func(df)

时差更加明显

这篇关于在MultiIndexed数据框中选择行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆