根据值拆分数据帧输出 [英] Split dataframe output based on values
问题描述
这篇文章涵盖了修改一个返回具有指定值的数据帧的函数,我想进一步修改输出.当前函数和向量化版本将得到所有列的组合相减并相应返回相关数据.
This post covered Modification of a function to return a dataframe with specified values and I would like to further modify the output. The current function and vectorized version will get all combinations of columns subtracted from each other and return relevant data accordingly.
示例和测试数据:
import pandas as pd
import numpy as np
from itertools import combinations
df2 = pd.DataFrame(
{'AAA' : [80,5,6],
'BBB' : [85,20,30],
'CCC' : [100,50,25],
'DDD' : [98,50,25],
'EEE' : [103,50,25],
'FFF' : [105,50,25],
'GGG' : [109,50,25]});
df2
AAA BBB CCC DDD EEE FFF GGG
0 80 85 100 98 103 105 109
1 5 20 50 50 50 50 50
2 6 30 25 25 25 25 25
v = df2.values
df3 = df2.mask((np.abs(v[:, :, None] - v[:, None]) <= 5).sum(-1) <= 1)
df3
AAA BBB CCC DDD EEE FFF GGG
0 80.0 85.0 100 98 103 105 109
1 NaN NaN 50 50 50 50 50
2 NaN 30.0 25 25 25 25 25
thresh 中的所有值(此处为 5)以 np.abs <=5
为单位逐行返回.
All values within thresh (5 here) are returned on a per row basis with np.abs <=5
.
什么需要改变?
在 df3
的第一行,阈值 (80,85) 和 (100,98,103,105,109) 中有两个值簇.它们都是有效的,但是是两个独立的组,不在 thresh
之内.我希望能够根据另一个 thresh
值来分离这些值.
On the first row of df3
there are two clusters of values within thresh (80,85) and (100,98,103,105,109). They are all valid but are two separate groups as not within thresh
. I would like to be able to separate these values based on another thresh
value.
我试图用以下(有缺陷的)代码来展示我想要做什么,并且只包含这个来表明我正在尝试自己推进这个..
I have attempted to demonstrate what I am looking to do with the following (flawed) code and only including this to show that Im attempting to progress this myself..
df3.mask(df3.apply(lambda x : x >= df3.T.max() \
- (thresh * 3))).dropna(thresh=2).dropna(axis=1)
AAA BBB
0 80.0 85.0
df3.mask(~df3.apply(lambda x : x >= df3.T.max() - (thresh * 3))).dropna(axis=1)
CCC DDD EEE FFF GGG
0 100 98 103 105 109
1 50 50 50 50 50
2 25 25 25 25 25
所以我的输出很好(并且显示接近所需的输出)但是我得到它的方式不是很好......
So my output is nice (and shows close to desired output) but the way I got this is not so nice...
---期望的输出: ---
我使用了多行来演示,但是当我使用此代码时,它只需要输出和拆分一行.因此,所需的输出是根据此示例为行 0
返回单独的列.
I have used multiple rows to demonstrate but when I use this code it will only be one row that needs to be output and split. So desired output is to return the separate columns as per this example for row 0
.
CCC DDD EEE FFF GGG
0 100 98 103 105 109
和
AAA BBB
0 80.0 85.0
推荐答案
我觉得这个问题值得单独回答.
I felt this was deserving of a separate answer.
我编写了一个对一维数组进行操作的聚类函数.我知道如何将其进一步矢量化为 2 维,但我还没有做到.实际上,我使用 np.apply_along_axis
I wrote a clustering function that operates on one dimensional arrays. I know how to vectorize it further to 2 dimensions but I haven't gotten to it yet. As it is, I use np.apply_along_axis
此功能在此答案中对此问题.我鼓励您点击链接,看看为获得这个看似简单的功能所做的工作.
This function is described in this answer to this question. I encourage you to follow the links and see the work that went into getting this seemingly simple function.
它的作用是在由每个点左右边距定义的数组中找到簇.它先排序,然后聚类,然后不排序.
What it does is find the clusters within an array defined by margins to the left and right of every point. It sorts, then clusters, then un sorts.
delta 聚类函数
def delta_cluster(a, dleft, dright):
s = a.argsort()
y = s.argsort()
a = a[s]
rng = np.arange(len(a))
edge_left = a.searchsorted(a - dleft)
starts = edge_left == rng
edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
ends = edge_right == rng
return (starts & ends).cumsum()[y]
<小时>
解决手头的问题
将df2
中每一行的cluster函数与np.apply_along_axis
一起使用,构造一个名为clusters
的DataFrame
它反映了与 df2
相同的索引和列.然后 stack
得到一个 Series
这将使以后更容易操作.
Use the cluster function for each row in df2
with np.apply_along_axis
and construct a DataFrame
named clusters
that mirrors the same index and columns as df2
. Then stack
to get a Series
which will make it easier to manipulate later.
clusters = pd.DataFrame(
np.apply_along_axis(delta_cluster, 1, df2.values, 10, 10),
df2.index, df2.columns).stack()
<小时>
这描述了下一个代码块.
This describes the next block of code.
- 我在做
groupby
时需要保留df2
的行信息. - 使用
transform
获取每行的簇大小. stack
df2
的值并将簇值附加为索引的一部分.这可以实现您正在寻找的分离.mask
val
其中size
等于 1.这些是单例集群.
- I need to keep the row information of
df2
when I do agroupby
. - Use
transform
to get the size of clusters for each row. stack
the values ofdf2
and append the cluster values as part of the index. This enables the separation you are looking for.mask
val
wheresize
is equal to 1. These are singleton clusters.
lvl0 = clusters.index.get_level_values(0)
size = clusters.groupby([lvl0, clusters]).transform('size')
val = df2.stack().to_frame('value').set_index(clusters, append=True).value
val.mask(size.values == 1).dropna().unstack(1)
AAA BBB CCC DDD EEE FFF GGG
0 1 80.0 85.0 NaN NaN NaN NaN NaN
2 NaN NaN 100.0 98.0 103.0 105.0 109.0
1 3 NaN NaN 50.0 50.0 50.0 50.0 50.0
2 2 NaN 30.0 25.0 25.0 25.0 25.0 25.0
这与您的结果一致,只是我将第一行分成两行.
This matches your results except I split out the first row into two rows.
AAA BBB CCC DDD EEE FFF GGG
0 80.0 85.0 100 98 103 105 109
1 NaN NaN 50 50 50 50 50
2 NaN 30.0 25 25 25 25 25
这篇关于根据值拆分数据帧输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!