算法自动配对意义分组在研发标签 [英] Algorithm for automating pairwise significance grouping labels in R

查看:391
本文介绍了算法自动配对意义分组在研发标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题挣扎了一段时间后,我希望能在这里得到一些建议。我想知道是否有人知道用于基于意义的配对分组标签的自动方法。现在的问题是独立的显着性检验(例如图基用于参或曼 - 惠特尼非参数)的 - 定这些成对比较,一些箱线图型数字往往重新present这些分组与子脚本:

我已经这样做了例如通过手,它可以是相当繁琐。我认为,标签的算法中的序列,应根据各组的级别数 - 例如那些含有单个水平是显著不同于其他所有级别组应该首先命名,然后将含有2层,然后3等基团,所有的同时检查新分组增加一个新的需要的分组,不违反和差异。

在下面的例子中,棘手的部分是获得算法认识到级1应与图3和5,但3和5不应该被分组进行分组(即共享一个标签)。

例子code:

  set.seed(1)
N'LT;  -  7
N2<  -  100
亩<  -  cumsum(runif(N,分= -3,上限= 3))
西格玛<​​  -  runif(N,分= 1,最大值= 3)

DAT<  - 向量(模式=名单,N)
为(ⅰseq中(DAT)){
    的DAT [〔Ⅰ〕25;  -  RNORM(n2时,平均=亩[I]中,SD =西格玛[I])
}

DF&所述;  -  data.frame(组= as.factor(代表(SEQ(n)时,每= n2)的)中,y =不公开(DAT))

BP<  - 箱线图(Y〜组,DF,缺口= TRUE)
KR<  -  kruskal.test(Y〜组,DF)
KR
MW<  -  pairwise.wilcox.test(DF $ Y,DF $ G)
MW
MW $ p.value> 0.05#TRUE意味着该水平并不显著不同在p = 0.05的水平上

#1 2 3 4 5 6
#2 FALSE NA NA NA NA NA
#3是否NA NA NA NA
#4假假假NA NA NA
#5 TRUE假假假NA NA
#6假假假是否NA
#7假假假假假假

文本(X = 1:N,Y = BP $统计[4],标签= C(AB,C,A,D,B,D,E)山口= 1,CEX = 1.5,POS = 3,字体= 2)
 

解决方案

首先,让我重申这个问题在图论的语言。定义如下的曲线图。每个样品引起了重新presents它一个顶点。两个顶点之间,有一个边缘,当且仅当一些试验表明,样品重新指这些顶点psented $ P $不能统计学区分。在图论中,的集团的是一组顶点这样的,在该组的每两个顶点之间,有一个边缘。我们正在寻找派系的集合,使得图中的每条边所属的派系(至少?到底是什么?)之一。我们希望用尽可能少的派系越好。 (这个问题被称为集团的边缘覆盖,不派盖。)然后我们给每个集团自己的文字和以该字母标记的成员。每个样品的所有其他区分都有自己的信也。

例如,对应样品输入图中可以得出这样的。

  3 --- 1 --- 5 4-6
 

我的算法如下。构建图形和使用勒布朗 - Kerbosch算法找到所有最大派系。对于上面的曲线图,这些是{1,3},{1,5}和{4,6}。集合{1},例如,为一个集团,但它不是最大,因为它是这种集团型的一个子集{1,3}。集合{1,3,5}不是小集团,因为有3和5之间没有边缘在图中

  1
 / \
3 --- 5 4-6,
 

最大小集团。将{1,3,5}和{4,6}。

现在递归搜索一个小集团边盖。输入到我们的递归函数是一组剩余的被覆盖的边缘和最大派系的列表。找到最边缘的剩余集,其中,例如,边缘(1,2)< (1,5)&其中; (2,3)&其中; (2,5)&其中; (3,4)。对于每一个极大团包含此边缘,构造由该集团并且其中集团边缘被从组其余边缘的除去递归调用的输出的候选解决方案。输出的最佳人选。

除非有极少数的边缘,这可能是太慢了。第一个性能改进是memoize的:保持与输入的映射到递归函数的输出,使我们能避免做工作的两倍。如果不行,则R应该有一个接口的整数规划解算器,我们可以用整数规划确定拉帮结派的最佳集合。 (我会解释这更多,如果其他的办法是不够的。)

After struggling with this problem for a while, I am hoping to get some advice here. I am wondering if anyone is aware of an automated method for determining pairwise grouping labels based on significance. The question is independent of the significance test (e.g. Tukey for parametric or Mann-Whitney for non-parametric) - given these pairwise comparisons, some boxplot-type figures often represent these groupings with a sub-script:

I have done this example by hand, which can be quite tedious. I think that the sequence of labeling in the algorithm should be based on the number of levels in each group - e.g. those groups containing single levels that are significantly different from all other levels should be named first, then groups containing 2 levels, then 3, etc., all the while checking that new groupings add a new needed grouping and do not violate and differences.

In the example below, the tricky part is getting the algorithm to recognize that level 1 should be grouped with 3 and 5, but 3 and 5 should not be grouped (i.e. share a label).

Example code:

set.seed(1)
n <- 7
n2 <- 100
mu <- cumsum(runif(n, min=-3, max=3))
sigma <- runif(n, min=1, max=3)

dat <- vector(mode="list", n)
for(i in seq(dat)){
    dat[[i]] <- rnorm(n2, mean=mu[i], sd=sigma[i])
}

df <- data.frame(group=as.factor(rep(seq(n), each=n2)), y=unlist(dat))

bp <- boxplot(y ~ group, df, notch=TRUE)
kr <- kruskal.test(y ~ group, df)
kr
mw <- pairwise.wilcox.test(df$y, df$g)
mw
mw$p.value > 0.05 # TRUE means that the levels are not significantly different at the p=0.05 level

#      1     2     3     4     5     6
#2 FALSE    NA    NA    NA    NA    NA
#3  TRUE FALSE    NA    NA    NA    NA
#4 FALSE FALSE FALSE    NA    NA    NA
#5  TRUE FALSE FALSE FALSE    NA    NA
#6 FALSE FALSE FALSE  TRUE FALSE    NA
#7 FALSE FALSE FALSE FALSE FALSE FALSE

text(x=1:n, y=bp$stats[4,], labels=c("AB", "C", "A", "D", "B", "D", "E"), col=1, cex=1.5, pos=3, font=2)

解决方案

First let me restate the problem in the language of graph theory. Define a graph as follows. Each sample gives rise to a vertex that represents it. Between two vertices, there is an edge if and only if some test indicates that the samples represented by those vertices could not be distinguished statistically. In graph theory, a clique is a set of vertices such that, between every two vertices in the set, there is an edge. We're looking for a collection of cliques such that every edge in the graph belongs to (at least? exactly?) one of the cliques. We'd like to use as few cliques as possible. (This problem is called clique edge cover, not clique cover.) We then assign each clique its own letter and label its members with that letter. Each sample distinguishable from all others gets its own letter as well.

For example, the graph corresponding to your sample input could be drawn like this.

3---1---5       4--6

My proposed algorithm is the following. Construct the graph and use the Bron--Kerbosch algorithm to find all maximal cliques. For the graph above, these are {1, 3}, {1, 5}, and {4, 6}. The set {1}, for example, is a clique, but it is not maximal because it is a subset of the clique {1, 3}. The set {1, 3, 5} is not a clique because there is no edge between 3 and 5. In the graph

  1
 / \
3---5       4--6,

the maximal cliques would be {1, 3, 5} and {4, 6}.

Now search recursively for a small clique edge cover. The input to our recursive function is a set of edges remaining to be covered and the list of maximal cliques. Find the least edge in the remaining set, where, e.g., edge (1,2) < (1,5) < (2,3) < (2,5) < (3,4). For each maximal clique that contains this edge, construct a candidate solution comprised of that clique and the output of a recursive call where the clique edges are removed from set of edges remaining. Output the best candidate.

Unless there are very few edges, this may be too slow. The first performance improvement is memoize: maintain a map from inputs to outputs of the recursive function so that we can avoid doing the work twice. If that doesn't work, then R should have an interface to an integer program solver, and we can use integer programming to determine the best collection of cliques. (I'll explain this more if the other approach is insufficient.)

这篇关于算法自动配对意义分组在研发标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆