根据值拆分数据帧输出 [英] Split dataframe output based on values

查看:63
本文介绍了根据值拆分数据帧输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这篇文章涵盖了修改一个返回具有指定值的数据帧的函数,我想进一步修改输出.当前函数和向量化版本将得到所有列的组合相减并相应返回相关数据.

This post covered Modification of a function to return a dataframe with specified values and I would like to further modify the output. The current function and vectorized version will get all combinations of columns subtracted from each other and return relevant data accordingly.

示例和测试数据:

import pandas as pd
import numpy as np
from itertools import combinations

df2 = pd.DataFrame(
       {'AAA' : [80,5,6], 
        'BBB' : [85,20,30],
        'CCC' : [100,50,25],
        'DDD' : [98,50,25],
        'EEE' : [103,50,25],
        'FFF' : [105,50,25],
        'GGG' : [109,50,25]});

df2

AAA BBB CCC DDD EEE FFF GGG
0   80  85  100 98  103 105 109
1   5   20  50  50  50  50  50
2   6   30  25  25  25  25  25

v = df2.values
df3 = df2.mask((np.abs(v[:, :, None] - v[:, None]) <= 5).sum(-1) <= 1)

df3   
    AAA BBB CCC DDD EEE FFF GGG
0   80.0    85.0    100 98  103 105 109
1   NaN NaN 50  50  50  50  50
2   NaN 30.0    25  25  25  25  25

thresh 中的所有值(此处为 5)以 np.abs <=5 为单位逐行返回.

All values within thresh (5 here) are returned on a per row basis with np.abs <=5.

什么需要改变?

df3 的第一行,阈值 (80,85) 和 (100,98,103,105,109) 中有两个值簇.它们都是有效的,但是是两个独立的组,不在 thresh 之内.我希望能够根据另一个 thresh 值来分离这些值.

On the first row of df3 there are two clusters of values within thresh (80,85) and (100,98,103,105,109). They are all valid but are two separate groups as not within thresh. I would like to be able to separate these values based on another thresh value.

我试图用以下(有缺陷的)代码来展示我想要做什么,并且只包含这个来表明我正在尝试自己推进这个..

I have attempted to demonstrate what I am looking to do with the following (flawed) code and only including this to show that Im attempting to progress this myself..

df3.mask(df3.apply(lambda x : x >= df3.T.max() \
                   - (thresh * 3))).dropna(thresh=2).dropna(axis=1)


          AAA   BBB
    0   80.0    85.0

df3.mask(~df3.apply(lambda x : x >= df3.T.max() - (thresh * 3))).dropna(axis=1)


    CCC DDD EEE FFF GGG
0   100 98  103 105 109
1   50  50  50  50  50
2   25  25  25  25  25

所以我的输出很好(并且显示接近所需的输出)但是我得到它的方式不是很好......

So my output is nice (and shows close to desired output) but the way I got this is not so nice...

---期望的输出: ---

我使用了多行来演示,但是当我使用此代码时,它只需要输出和拆分一行.因此,所需的输出是根据此示例为行 0 返回单独的列.

I have used multiple rows to demonstrate but when I use this code it will only be one row that needs to be output and split. So desired output is to return the separate columns as per this example for row 0.

    CCC DDD EEE FFF GGG
0   100 98  103 105 109

         AAA    BBB
    0   80.0    85.0

推荐答案

我觉得这个问题值得单独回答.

I felt this was deserving of a separate answer.

我编写了一个对一维数组进行操作的聚类函数.我知道如何将其进一步矢量化为 2 维,但我还没有做到.实际上,我使用 np.apply_along_axis

I wrote a clustering function that operates on one dimensional arrays. I know how to vectorize it further to 2 dimensions but I haven't gotten to it yet. As it is, I use np.apply_along_axis

此功能在此答案中对此问题.我鼓励您点击链接,看看为获得这个看似简单的功能所做的工作.

This function is described in this answer to this question. I encourage you to follow the links and see the work that went into getting this seemingly simple function.

它的作用是在由每个点左右边距定义的数组中找到簇.它先排序,然后聚类,然后不排序.

What it does is find the clusters within an array defined by margins to the left and right of every point. It sorts, then clusters, then un sorts.

delta 聚类函数

def delta_cluster(a, dleft, dright):
    s = a.argsort()
    y = s.argsort()
    a = a[s]
    rng = np.arange(len(a))

    edge_left = a.searchsorted(a - dleft)
    starts = edge_left == rng

    edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
    ends = edge_right == rng

    return (starts & ends).cumsum()[y]

<小时>

解决手头的问题

df2中每一行的cluster函数与np.apply_along_axis一起使用,构造一个名为clustersDataFrame它反映了与 df2 相同的索引和列.然后 stack 得到一个 Series 这将使以后更容易操作.

Use the cluster function for each row in df2 with np.apply_along_axis and construct a DataFrame named clusters that mirrors the same index and columns as df2. Then stack to get a Series which will make it easier to manipulate later.

clusters = pd.DataFrame(
    np.apply_along_axis(delta_cluster, 1, df2.values, 10, 10),
    df2.index, df2.columns).stack()

<小时>

这描述了下一个代码块.


This describes the next block of code.

  • 我在做groupby时需要保留df2的行信息.
  • 使用 transform 获取每行的簇大小.
  • stack df2 的值并将簇值附加为索引的一部分.这可以实现您正在寻找的分离.
  • mask val 其中 size 等于 1.这些是单例集群.
  • I need to keep the row information of df2 when I do a groupby.
  • Use transform to get the size of clusters for each row.
  • stack the values of df2 and append the cluster values as part of the index. This enables the separation you are looking for.
  • mask val where size is equal to 1. These are singleton clusters.
lvl0 = clusters.index.get_level_values(0)
size = clusters.groupby([lvl0, clusters]).transform('size')

val = df2.stack().to_frame('value').set_index(clusters, append=True).value

val.mask(size.values == 1).dropna().unstack(1)

      AAA   BBB    CCC   DDD    EEE    FFF    GGG
0 1  80.0  85.0    NaN   NaN    NaN    NaN    NaN
  2   NaN   NaN  100.0  98.0  103.0  105.0  109.0
1 3   NaN   NaN   50.0  50.0   50.0   50.0   50.0
2 2   NaN  30.0   25.0  25.0   25.0   25.0   25.0

这与您的结果一致,只是我将第一行分成两行.

This matches your results except I split out the first row into two rows.

     AAA   BBB    CCC   DDD    EEE    FFF    GGG
0   80.0  85.0    100    98    103    105    109
1    NaN   NaN     50    50     50     50     50
2    NaN  30.0     25    25     25     25     25

这篇关于根据值拆分数据帧输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆