循环和集群 [英] Looping and clustering

查看:175
本文介绍了循环和集群的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先我得承认这太难为我做我自己。我来分析一些数据,这一步是至关重要的我。我将非常感激(甚至开始赏金给一些名声点,为正确答案),如果有人为我提供了解决方案。它必须尽快完成的,这就是为什么我放弃了这一点,决定请你帮忙。

数据,我想分析一下:

 > dput(tbl_clustering)
结构(列表(P1 =结构(C(14L,14L,6L,6L,6L,19L,15L,
13L,13L,13L,13L,10L,10L,6L,6L,10L,27L,27L,27L,27L,
27L,22L,22L,22L,21L,21L,21L,27L,27L,27L,27L,21L,21L,
21L,28L,28L,25L,25L,25L,29L,29L,17L,17L,17L,5L,5L,
5L,5L,20L,20L,23L,23L,23L,23L,7L,26L,26L,24L,24L,
24L,24L,3L,3L,3L,9L,8L,2L,11L,11L,11L,11L,11L,12L,
12L,4L,4L,4L,1L,1L,1L,18L,18L,18L,18L,18L,18L,18L,
18L,18L,18L,18L,16L,16L,16L,16L,16L,16L,16L),.Label = C(AT1G09130
AT1G09620,AT1G10760,AT1G14610,AT1G43170,AT1G58080,
AT2G27680,AT2G27710,AT3G03710,AT3G05590,AT3G11510,
AT3G56130,AT3G58730,AT3G61540,AT4G03520,AT4G22930,
AT4G33030,AT5G01600,AT5G04710,AT5G17990,AT5G19220,
AT5G43940,AT5G63310,ATCG00020,ATCG00380,ATCG00720,
ATCG00770,ATCG00810,ATCG00900)中,class =因素),P2 =结构(C(55L,
54L,29L,4L,70L,72L,18L,9L,58L,68L,19L,6L,1L,16L,
34L,32L,77L,12L,61L,41L,71L,73L,50L,11L,69L,22L,60L,
42L,47L,45L,59L,30L,24L,23L,77L,45L,12L,47L,59L,82L,
75L,40L,26L,83L,81L,47L,36L,45L,2L,65L,11L,38L,13L,
31L,53L,78L,7L,80L,79L,7L,76L,17L,10L,3L,68L,51L,
48L,62L,58L,64L,68L,74L,63L,14L,57L,33L,56L,39L,52L,
35L,43L,25L,27L,21L,15L,5L,49L,37L,66L,20L,44L,69L,
22L,67L,57L,8L,46L,28L),.Label = C(AT1G01090,AT1G02150
AT1G03870,AT1G09795,AT1G13060,AT1G14320,AT1G15820,
AT1G17745,AT1G20630,AT1G29880,AT1G29990,AT1G43170,
AT1G52340,AT1G52670,AT1G56450,AT1G59900,AT1G69830,
AT1G75330,AT1G78570,AT2G05840,AT2G28000,AT2G34590,
AT2G35040,AT2G37020,AT2G40300,AT2G42910,AT2G44050,
AT2G44350,AT2G45440,AT3G01500,AT3G03980,AT3G04840,
AT3G07770,AT3G13235,AT3G14415,AT3G18740,AT3G22110,
AT3G22480,AT3G22960,AT3G51840,AT3G54210,AT3G54400,
AT3G56090,AT3G60820,AT4G00100,AT4G00570,AT4G02770,
AT4G11010,AT4G14800,AT4G18480,AT4G20760,AT4G26530,
AT4G28750,AT4G30910,AT4G30920,AT4G33760,AT4G34200,
AT5G02500,AT5G02960,AT5G10920,AT5G12250,AT5G13120,
AT5G16390,AT5G18380,AT5G35360,AT5G35590,AT5G35630,
AT5G35790,AT5G48300,AT5G52100,AT5G56030,AT5G60160,
AT5G64300,AT5G67360,ATCG00160,ATCG00270,ATCG00380,
ATCG00540,ATCG00580,ATCG00680,ATCG00750,ATCG00820,
ATCG01110)中,class =因素),No_Interactions = C(8L,5L,
5L,9L,7L,6L,5L,5L,5L,5L,5L,5L,5L,6L,6L,5L,8L,6L,
5L,5L,5L,5L,5L,5L,10L,6L,6L,5L,5L,5L,5L,8L,5L,
5L,7L,5L,5L,5L,5L,5L,5L,5L,5L,5L,6L,5L,5L,5L,5L,
6L,5L,5L,6L,5L,5L,6L,5L,6L,5L,5L,5L,5L,5L,5L,6L,
5L,5L,5L,5L,6L,5L,5L,5L,6L,5L,5L,5L,5L,5L,5L,7L,
8L,5L,5L,5L,5L,5L,5L,5L,5L,5L,5L,5L,5L,7L,5L,5L,
6L)),.Names = C(P1,P2,No_Interactions)中,class =data.frame,row.names = C(NA,
-98L))
 

要更好地解释什么,我想要实现我会贴上一些行看过来:

  P1 P2 No_Interactions
1 AT3G61540 AT4G30920 8
2 AT3G61540 AT4G30910 5
3 AT1G58080 AT2G45440 5
4 AT1G58080 AT1G09795 9
5 AT1G58080 AT5G52100 7
6 AT5G04710 AT5G60160 6
7 AT4G03520 AT1G75330 5
8 AT3G58730 AT1G20630 5
9 AT3G58730 AT5G02500 5
10 AT3G58730 AT5G35790 5
 

首先,新列集群必须创建。接下来,我们只专注于两列 P1 P2 。正如你在第一排看到,我们有两个名字 AT3G61540 AT4G30920 ,这就是我们的出发点(环路相信会是必要的)。我们把数字1 集群列。比我们通过两列取名字 AT3G61540 和扫描 P1 P2 如果我们发现这个名字再一次的地方与其他的名字比第一行我们把数字1,以及在集群。下一步,我们采取的第二个名字,从第一行 AT4G30920 和做同样的筛选通过整个数据。 下一步将是分析下一行并做同样的事情。在接下来的一行在这种情况下,我们有相同的名称 P1 ,这意味着我们并不需要筛选,但第二个名字 AT4G30910 是不同的,因此将是伟大的,一个筛选。这似乎这里的问题是,此行应该是集群1 为好。该集群2 开始的第三排,因为我们有完全地对新的名字。

我知道,不是那么容易的事,也许它必须在几个步骤进行。在这种情况下,我提供100代表处点为一个谁找到最佳的解决方案(赏金将于几天内给出)。

编辑: 我想输出获得:

  P1 P2 No_Interactions集群
1 AT3G61540 AT4G30920 8 1
2 AT3G61540 AT4G30910 5 1
3 AT1G58080 AT2G45440 5 2
4 AT1G58080 AT1G09795 9 2
5 AT1G58080 AT5G52100 7 2
6 AT5G04710 AT5G60160 6 3
7 AT5G52100 AT1G75330 5 2 ###第2组,因为AT5G52100中发现的行数5作为AT1G58080的合作伙伴
8 AT3G58730 AT1G20630 5
9 AT3G58730 AT5G02500 5
10 AT3G58730 AT3G61540 5 1 ##第1组,因为AT3G61540被发现在第一排。
 

解决方案

我纠正了我最初的回答,并提出您的功能编程方法,使用地图递归来找到你的集群:

 库(magrittr)

类似=功能(U,V)如果(长度(相交(U,V))== 0)假东西真

clusterify =函数(DF)
{
    集群= DF $集群

    如果(!任何(集群== 0))收益率(DF)

    IDX = pmatch(0,集群)
    LST =地图(C,as.character(DF [,1]),as.character(DF [,2]))
    EL = C(as.character(DF [IDX,1]),as.character(DF [IDX,2]))

    K = LST%>%
        sapply(类似,V = EL)%>%
        加(0)

    屏蔽= IF(任何(集群= 0&放大器;!满足K == 1))

    如果(任何(掩模))
    {
        CL =分钟(集群[面具])
        DF [K == 1] $集群= CL
    }
    其他
    {
        DF [K == 1] $集群= MAX(集群)+ 1
    }

    clusterify(DF)
}
 

您可以使用它通过 clusterify(变换(DF,集群= 0))

例如,聚类正确运行在你的榜样,通过采取第9组(你可以检查其他集群):

 子集(clusterify(变换(DF,集群= 0)),集群== 9)
#P1 P2 No_Interactions集群
#25 AT5G19220 AT5G48300 10 9
#26 AT5G19220 AT2G34590 6 9
#27 AT5G19220 AT5G10920 6 9
#32 AT5G19220 AT3G01500 8 9
#33 AT5G19220 AT2G37020 5 9
#34 AT5G19220 AT2G35040 5 9
#92 AT4G22930 AT5G48300 5 9
#93 AT4G22930 AT2G34590 5 9
#94 AT4G22930 AT5G35630 5 9
#95 AT4G22930 AT4G34200 7 9
#96 AT4G22930 AT1G17745 5 9
#97 AT4G22930 AT4G00570 5 9
#98 AT4G22930 AT2G44350 6 9
 

我会在后面添加一些解释就如何algorithmn继续寻找链集群。

First of all I have to admit that's too hard for me to do it on my own. I have to analyze some data and this step is crucial for me. I would be so grateful (even starting the bounty to give some more reputations points for the correct answer) if someone provides me the solution. It has to be done asap that's why I gave up on that and decided to ask you for help.

Data which I want to analyze:

> dput(tbl_clustering)
structure(list(P1 = structure(c(14L, 14L, 6L, 6L, 6L, 19L, 15L, 
13L, 13L, 13L, 13L, 10L, 10L, 6L, 6L, 10L, 27L, 27L, 27L, 27L, 
27L, 22L, 22L, 22L, 21L, 21L, 21L, 27L, 27L, 27L, 27L, 21L, 21L, 
21L, 28L, 28L, 25L, 25L, 25L, 29L, 29L, 17L, 17L, 17L, 5L, 5L, 
5L, 5L, 20L, 20L, 23L, 23L, 23L, 23L, 7L, 26L, 26L, 24L, 24L, 
24L, 24L, 3L, 3L, 3L, 9L, 8L, 2L, 11L, 11L, 11L, 11L, 11L, 12L, 
12L, 4L, 4L, 4L, 1L, 1L, 1L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 
18L, 18L, 18L, 18L, 16L, 16L, 16L, 16L, 16L, 16L, 16L), .Label = c("AT1G09130", 
"AT1G09620", "AT1G10760", "AT1G14610", "AT1G43170", "AT1G58080", 
"AT2G27680", "AT2G27710", "AT3G03710", "AT3G05590", "AT3G11510", 
"AT3G56130", "AT3G58730", "AT3G61540", "AT4G03520", "AT4G22930", 
"AT4G33030", "AT5G01600", "AT5G04710", "AT5G17990", "AT5G19220", 
"AT5G43940", "AT5G63310", "ATCG00020", "ATCG00380", "ATCG00720", 
"ATCG00770", "ATCG00810", "ATCG00900"), class = "factor"), P2 = structure(c(55L, 
54L, 29L, 4L, 70L, 72L, 18L, 9L, 58L, 68L, 19L, 6L, 1L, 16L, 
34L, 32L, 77L, 12L, 61L, 41L, 71L, 73L, 50L, 11L, 69L, 22L, 60L, 
42L, 47L, 45L, 59L, 30L, 24L, 23L, 77L, 45L, 12L, 47L, 59L, 82L, 
75L, 40L, 26L, 83L, 81L, 47L, 36L, 45L, 2L, 65L, 11L, 38L, 13L, 
31L, 53L, 78L, 7L, 80L, 79L, 7L, 76L, 17L, 10L, 3L, 68L, 51L, 
48L, 62L, 58L, 64L, 68L, 74L, 63L, 14L, 57L, 33L, 56L, 39L, 52L, 
35L, 43L, 25L, 27L, 21L, 15L, 5L, 49L, 37L, 66L, 20L, 44L, 69L, 
22L, 67L, 57L, 8L, 46L, 28L), .Label = c("AT1G01090", "AT1G02150", 
"AT1G03870", "AT1G09795", "AT1G13060", "AT1G14320", "AT1G15820", 
"AT1G17745", "AT1G20630", "AT1G29880", "AT1G29990", "AT1G43170", 
"AT1G52340", "AT1G52670", "AT1G56450", "AT1G59900", "AT1G69830", 
"AT1G75330", "AT1G78570", "AT2G05840", "AT2G28000", "AT2G34590", 
"AT2G35040", "AT2G37020", "AT2G40300", "AT2G42910", "AT2G44050", 
"AT2G44350", "AT2G45440", "AT3G01500", "AT3G03980", "AT3G04840", 
"AT3G07770", "AT3G13235", "AT3G14415", "AT3G18740", "AT3G22110", 
"AT3G22480", "AT3G22960", "AT3G51840", "AT3G54210", "AT3G54400", 
"AT3G56090", "AT3G60820", "AT4G00100", "AT4G00570", "AT4G02770", 
"AT4G11010", "AT4G14800", "AT4G18480", "AT4G20760", "AT4G26530", 
"AT4G28750", "AT4G30910", "AT4G30920", "AT4G33760", "AT4G34200", 
"AT5G02500", "AT5G02960", "AT5G10920", "AT5G12250", "AT5G13120", 
"AT5G16390", "AT5G18380", "AT5G35360", "AT5G35590", "AT5G35630", 
"AT5G35790", "AT5G48300", "AT5G52100", "AT5G56030", "AT5G60160", 
"AT5G64300", "AT5G67360", "ATCG00160", "ATCG00270", "ATCG00380", 
"ATCG00540", "ATCG00580", "ATCG00680", "ATCG00750", "ATCG00820", 
"ATCG01110"), class = "factor"), No_Interactions = c(8L, 5L, 
5L, 9L, 7L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 5L, 8L, 6L, 
5L, 5L, 5L, 5L, 5L, 5L, 10L, 6L, 6L, 5L, 5L, 5L, 5L, 8L, 5L, 
5L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 
6L, 5L, 5L, 6L, 5L, 5L, 6L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 
5L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 
8L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 7L, 5L, 5L, 
6L)), .Names = c("P1", "P2", "No_Interactions"), class = "data.frame", row.names = c(NA, 
-98L))

To explain better what I want to achieve I will paste some rows over here:

        P1        P2 No_Interactions
1  AT3G61540 AT4G30920               8
2  AT3G61540 AT4G30910               5
3  AT1G58080 AT2G45440               5
4  AT1G58080 AT1G09795               9
5  AT1G58080 AT5G52100               7
6  AT5G04710 AT5G60160               6
7  AT4G03520 AT1G75330               5
8  AT3G58730 AT1G20630               5
9  AT3G58730 AT5G02500               5
10 AT3G58730 AT5G35790               5

First of all the new column Cluster has to be created. Next we focus only on two columns P1 and P2. As you can see in first row we have two names AT3G61540 and AT4G30920 and that's our starting point (loop I believe will be necessary). We put the number 1 in Cluster column. Than we take first name AT3G61540 and scan through both columns P1 and P2 if we find this name once again somewhere with other name than in first row we put number 1 as well in Cluster. Next we take second name from first row AT4G30920 and do the same screening through whole data. The next step will be to analyze next row and do exactly the same things. In that case in the next row we have exactly the same name for P1 that means we don't need to screen it but the second name AT4G30910 is different so would be great to screen with that one. The problem which appears here is that this row should be the cluster 1 as well. The cluster 2 starts with third row because we have completly new pair of names.

I am aware that's not so easy task and probably it has to be done in couple steps. In that case I am offering 100 rep points for the one who find the best solution (the bounty will be given in few days).

EDIT: The output I would like to get:

       P1        P2 No_Interactions      Cluster
1  AT3G61540 AT4G30920               8      1
2  AT3G61540 AT4G30910               5      1
3  AT1G58080 AT2G45440               5      2
4  AT1G58080 AT1G09795               9      2
5  AT1G58080 AT5G52100               7      2
6  AT5G04710 AT5G60160               6      3
7  AT5G52100 AT1G75330               5      2 ### Cluster 2 because AT5G52100 was found in the row number 5 as a partner of AT1G58080
8  AT3G58730 AT1G20630               5      5
9  AT3G58730 AT5G02500               5      5
10 AT3G58730 AT3G61540               5      1 ## Cluster 1 because AT3G61540 was found in first row.

解决方案

I corrected my initial answer and propose you a functional programming approach, using Map and recursion to find your clusters:

library(magrittr)

similar = function(u,v) if(length(intersect(u,v))==0) FALSE else TRUE

clusterify = function(df)
{ 
    clusters = df$cluster

    if(!any(clusters==0)) return(df)

    idx = pmatch(0, clusters)
    lst = Map(c, as.character(df[,1]), as.character(df[,2]))
    el  = c(as.character(df[idx, 1]), as.character(df[idx, 2]))

    K = lst %>%
        sapply(similar, v=el) %>%
        add(0)

    mask = if(any(clusters!=0 & K==1))

    if(any(mask))
    {
        cl = min(clusters[mask])
        df[K==1,]$cluster = cl
    }
    else
    {
        df[K==1,]$cluster = max(clusters) + 1
    }

    clusterify(df)
}

You can use it by clusterify(transform(df, cluster=0))

For example, the clustering operates correctly on your example, by taking cluster 9 (you can check other clusters):

subset(clusterify(transform(df, cluster=0)), cluster==9)
#          P1        P2 No_Interactions cluster
#25 AT5G19220 AT5G48300              10       9
#26 AT5G19220 AT2G34590               6       9
#27 AT5G19220 AT5G10920               6       9
#32 AT5G19220 AT3G01500               8       9
#33 AT5G19220 AT2G37020               5       9
#34 AT5G19220 AT2G35040               5       9
#92 AT4G22930 AT5G48300               5       9
#93 AT4G22930 AT2G34590               5       9
#94 AT4G22930 AT5G35630               5       9
#95 AT4G22930 AT4G34200               7       9
#96 AT4G22930 AT1G17745               5       9
#97 AT4G22930 AT4G00570               5       9
#98 AT4G22930 AT2G44350               6       9

I will add some explanations later on how the algorithmn proceeds to find a chain to cluster.

这篇关于循环和集群的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆