使用两个分组名称来创建一个“组合”分组变量 [英] Using two grouping designations to create one 'combined' grouping variable

查看:130
本文介绍了使用两个分组名称来创建一个“组合”分组变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出一个data.frame:

  df <-data.frame(grp1 = c(1,1,1 ,2,2,2,3,3,3,4,4,4),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9, 10))

#> df
#grp1 grp2
#1 1 1
#2 1 2
#3 1 3
#4 2 3
#5 2 4
#6 2 5
#7 3 6
#8 3 7
#9 3 8
#10 4 6
#11 4 9
#12 4 10

两个coluns都是变量分组,因此 grp1 已知会组合在一起,依此类推,所有2都以此类推,依此类推。 grp2 也是如此。已知所有1都是相同的,所有2都是相同的。



因此,如果我们基于第1列查看第三行和第四行,我们知道第一行3行可以分组在一起,而后3行可以分组在一起。然后,由于第3行和第4行共享相同的 grp2 值,因此我们知道实际上所有6行都可以分组在一起。



基于相同的逻辑,我们可以看到最后六行也可以组合在一起(因为第7行和第10行共享相同的 grp2 )。

除了编写一组相当复杂的 for()循环外,还有没有更直接的方法呢?我还没有想到。



我希望获得的最终输出看起来像:

 #> df 
#grp1 grp2组合Grp
#1 1 1 1
#2 1 2 1
#3 1 3 1
#4 2 3 1
# 5 2 4 1
#6 2 5 1
#7 3 6 2
#8 3 7 2
#9 3 8 2
#10 4 6 2
#11 4 9 2
#12 4 10 2

感谢任何

解决方案

做到这一点的一种方法是通过矩阵,该矩阵根据组成员身份定义行之间的链接。



这种方法与 @Frank 的图形答案有关,但是使用邻接矩阵而不是使用边来定义图形。这种方法的优点是可以立即使用相同的代码处理多个> 2的分组列。 (只要编写灵活确定链接的函数即可。)一个缺点是您需要在行之间进行所有成对比较以构造矩阵,因此对于很长的向量,它可能会很慢。照原样, @Frank 的答案将适用于非常长的数据,或者如果您只有两列。



这些步骤是


  1. 比较基于组的行并将这些行定义为链接的(即,创建图形)

  2. 确定由1中的链接定义的图的连接的分量。

您可以通过两种方法进行操作。下面我展示了一种蛮力方式:2a)折叠链接,直到使用矩阵乘法达到稳定的链接结构为止; 2b)使用 hclust cutree 。您还可以在根据矩阵创建的图形上使用 igraph :: clusters



1. 在行
之间构造一个邻接矩阵(成对链接矩阵)(即,如果它们在同一组中,则矩阵条目为1,否则为0)。首先创建一个辅助函数,该函数确定是否链接了两行

  linked_rows<-函数(数据){
# #helper function
##返回一个_function_,以基于组成员身份比较两行数据
##。

##使用Vectorize,即使在索引向量上也可以使用
Vectorize(function(i,j){
## numeric:1 = i和j具有重叠的组成员身份
common<-vapply(名称(数据),函数(名称)
data [i,name] == data [j,name],
FUN.VALUE = FALSE)
as.numeric(any(common))
})
}

我在外部中使用它来构建矩阵,

 行< ;-1:nrow(df)
A<-外(行,行,linked_rows(df))

2a。。将2度链接折叠为1度链接。也就是说,如果行是由中间节点链接而不是直接链接的,则通过定义它们之间的链接将它们归为同一组。



一次迭代涉及:i)矩阵乘以得到A的平方,然后
ii)将平方矩阵中的任何非零项都设置为1(好像是第一个度,成对链接)

  ##定义为在
以下使用的函数lump_links<-函数(A){
A<-A%*%A
A [A> 0]<-1
A
}

重复此操作直到链接稳定

  oldA<-0 
i<-0
而(any(oldA!= A)){
oldA<-A
A<-lump_links(A)
}

2b。。使用 A 中的稳定链接结构来定义组(图形的连接组件)。您可以通过多种方法来实现。



一种方法是,首先定义一个距离对象,然后使用 hclust cutree 。如果您考虑一下,我们想将链接( A [i,j] == 1 )定义为距离0。因此步骤为 a)在dist对象中定义为距离0的链接, b)从dist对象, c)构造树,以零高度(即零距离)切割树:

  df $ combinedGrp<-cutree(hclust(as.dist(1-A)),h = 0)
df

在实践中,您可以对步骤 1进行编码- 2 在使用帮助器 lump_links linked_rows 的单个函数中:

  lump<-函数(df){
行<-:1:nrow(df)
A<-外部(行,行,链接的行(df))

oldA<-0
而(any(oldA!= A)){
oldA <-A
A<-lump_links(A)
}
df $ combinedGrp<-cutree(hclust(as.dist(1-A)),h = 0)
df
}

此项适用于原始的 df 以及 @rawr 答案中的结构

  df< ;-data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
块(df)

grp1 grp2合并Grp
1 1 1 1
2 1 2 1
3 1 3 1
4 2 3 1
5 2 4 1
6 2 5 1
7 3 6 2
8 3 7 2
9 3 8 2
10 4 6 2
11 4 9 2
12 4 10 2
13 5 11 1
14 5 3 1
15 6 12 3
16 7 3 1
17 8 6 2
18 9 12 3

PS



这里是使用 igraph 的版本,它与 @Frank 的答案更清楚:

  lump2<-函数(df){
行<-1:nrow(df )
A<-外部(行,行,链接的行(df))
cluster_A<-igraph :: clusters(igraph :: graph.adjacency(A))
df $ combinedGrp <-cluster_A $成员资格
df
}


Given a data.frame:

df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4),
                 grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10))

#> df
#   grp1 grp2
#1     1    1
#2     1    2
#3     1    3
#4     2    3
#5     2    4
#6     2    5
#7     3    6
#8     3    7
#9     3    8
#10    4    6
#11    4    9
#12    4   10

Both coluns are grouping variables, such that all 1's in column grp1 are known to be grouped together, and so on with all 2's, etc. Then the same goes for grp2. All 1's are known to be the same, all 2's the same.

Thus, if we look at the 3rd and 4th row, based on column 1 we know that the first 3 rows can be grouped together and the second 3 rows can be grouped together. Then since rows 3 and 4 share the same grp2 value, we know that all 6 rows, in fact, can be grouped together.

Based off the same logic we can see that the last six rows can also be grouped together (since rows 7 and 10 share the same grp2).

Aside from writing a fairly involved set of for() loops, is there a more straight forward approach to this? I haven't been able to think one one yet.

The final output that I'm hoping to obtain would look something like:

# > df
#    grp1 grp2 combinedGrp
# 1     1    1           1
# 2     1    2           1
# 3     1    3           1
# 4     2    3           1
# 5     2    4           1
# 6     2    5           1
# 7     3    6           2
# 8     3    7           2
# 9     3    8           2
# 10    4    6           2
# 11    4    9           2
# 12    4   10           2

Thank you for any direction on this topic!

解决方案

One way to do this is via a matrix that defines links between rows based on group membership.

This approach is related to @Frank's graph answer but uses an adjacency matrix rather than using edges to define the graph. An advantage of this approach is it can deal immediately with many > 2 grouping columns with the same code. (So long as you write the function that determines links flexibly.) A disadvantage is you need to make all pair-wise comparisons between rows to construct the matrix, so for very long vectors it could be slow. As is, @Frank's answer would work better for very long data, or if you only ever have two columns.

The steps are

  1. compare rows based on groups and define these rows as linked (i.e., create a graph)
  2. determine connected components of the graph defined by the links in 1.

You could do 2 a few ways. Below I show a brute force way where you 2a) collapse links, till reaching a stable link structure using matrix multiplication and 2b) convert the link structure to a factor using hclust and cutree. You could also use igraph::clusters on a graph created from the matrix.

1. construct an adjacency matrix (matrix of pairwise links) between rows (i.e., if they in the same group, the matrix entry is 1, otherwise it's 0). First making a helper function that determines whether two rows are linked

linked_rows <- function(data){
  ## helper function
  ## returns a _function_ to compare two rows of data
  ##  based on group membership.

  ## Use Vectorize so it works even on vectors of indices
  Vectorize(function(i, j) {
    ## numeric: 1= i and j have overlapping group membership
    common <- vapply(names(data), function(name)
                     data[i, name] == data[j, name],
                     FUN.VALUE=FALSE)
    as.numeric(any(common))
  })
}

which I use in outer to construct a matrix,

rows <- 1:nrow(df)
A <- outer(rows, rows, linked_rows(df)) 

2a. collapse 2-degree links to 1-degree links. That is, if rows are linked by an intermediate node but not directly linked, lump them in the same group by defining a link between them.

One iteration involves: i) matrix multiply to get the square of A, and ii) set any non-zero entry in the squared matrix to 1 (as if it were a first degree, pairwise link)

## define as a function to use below
lump_links <- function(A) {
  A <- A %*% A
  A[A > 0] <- 1
  A
}

repeat this till the links are stable

oldA <- 0
i <- 0
while (any(oldA != A)) {
  oldA <- A
  A <- lump_links(A)
}

2b. Use the stable link structure in A to define groups (connected components of the graph). You could do this a variety of ways.

One way, is to first define a distance object, then use hclust and cutree. If you think about it, we want to define linked (A[i,j] == 1) as distance 0. So the steps are a) define linked as distance 0 in a dist object, b) construct a tree from the dist object, c) cut the tree at zero height (i.e., zero distance):

df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
df

In practice you can encode steps 1 - 2 in a single function that uses the helper lump_links and linked_rows:

lump <- function(df) {
  rows <- 1:nrow(df)
  A <- outer(rows, rows, linked_rows(df))

  oldA <- 0
  while (any(oldA != A)) {
    oldA <- A
    A <- lump_links(A)
  }
  df$combinedGrp <- cutree(hclust(as.dist(1 - A)), h = 0)
  df
}

This works for the original df and also for the structure in @rawr's answer

df <- data.frame(grp1 = c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,6,7,8,9),
                 grp2 = c(1,2,3,3,4,5,6,7,8,6,9,10,11,3,12,3,6,12))
lump(df)

   grp1 grp2 combinedGrp
1     1    1           1
2     1    2           1
3     1    3           1
4     2    3           1
5     2    4           1
6     2    5           1
7     3    6           2
8     3    7           2
9     3    8           2
10    4    6           2
11    4    9           2
12    4   10           2
13    5   11           1
14    5    3           1
15    6   12           3
16    7    3           1
17    8    6           2
18    9   12           3

PS

Here's a version using igraph, which makes the connection with @Frank's answer more clear:

  lump2 <- function(df) {
      rows <- 1:nrow(df)
      A <- outer(rows, rows, linked_rows(df))
      cluster_A <- igraph::clusters(igraph::graph.adjacency(A))
      df$combinedGrp <- cluster_A$membership
      df
    }

这篇关于使用两个分组名称来创建一个“组合”分组变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆