R:计算数据框的行方向相似度,并根据该相似度对数据进行排序 [英] R: Compute the row-wise similarity for a dataframe and sort data out based on that similarity

查看:82
本文介绍了R:计算数据框的行方向相似度,并根据该相似度对数据进行排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

伙计们,对不起,但我对行明智的相似性比较心存疑虑.我有几百个运行的聚类结果表.他们看起来像这样

第一列是样品ID,然后每次运行时,我要求Kmeans给我8个簇,每次运行中有10%的样品脱落以进行稳定性测试.

由于每次运行都是独立的,因此run_0中的簇1不等于run_1中的簇1,因此这些数字是随机分配的.

我想计算每个样本行的相似度,也就是说,我想知道在这数百次运行中,大部分时间里哪些样本都保持在一起.

我看过这篇文章,这几乎是我想要的

Hi guys sorry but I'm having a mental knot over the row wise similarity comparison. I have a table of clustering results over a few hundred runs. And they look like this

First column is sample ID and then each run I asked the Kmeans to give me 8 clusters, with a 10% sample drop out per run for stability test.

Because each run is independent, so cluster 1 in run_0 does not equal to cluster 1 in run_1, these numbers are randomly assigned.

I want to compute the similarity of each sample row, that is I want to know what samples are staying together for the most of the time during these hundreds of runs.

I have seen this post and it is almost what I want

How to calculate the similarity for all the rows in a table in R?

However, I just did not quite get the function.

row_cf <- function(x, y, df){
  sum(df[x,] == df[y,])/ncol(df)
}

Could you guys explain to me a little more on this function. I don't understand why the sum of (df[x,] == df[y,])/ncol(df)) could represent the similarity between row X and Y. Is this asking how many columns in row x are equal to row y, and then sum them up and give me the proportion of the similar columns?

If this is the case, then assigning all NAs to a fixed value like 9 will increase the similarity right?

Calculate row similarity percentage pair wise and add it as a new colum

I have seen this post, and the output is kinda what I want.My end goal is to have an output that Patient sample ID in first column to represent the entire data, and 2nd column will be the Patient sample ID of the most similar sample, and the 3rd column is the similarity score.

Dummy data if you need

Sample <- LETTERS[seq( from = 1, to = 20 )]
run_1 <- rep(1:4, each=5)
run_2 <- c(rep(1:2, each=4),rep(3:4,6))
run_3 <- rep(4:1, each=5)
run_4 <- c(rep(4:3, each=4),rep(1:2,6))

df <- data.frame(cbind(Sample, run_1,run_2,run_3,run_4))


#switch off row names
df1 <- df %>% remove_rownames() %>%
  column_to_rownames(var="patient_sample")


#replace NA to some value outside the cluster ID range

df1[is.na(df1)] <- 10



# define a similary funciton

 row_cf <- function(x, y, df){
   sum(df[x,]==df[y,])/ncol(df)
 }


#calculate the similarity

Sim <- expand.grid(1:nrow(df1), 1:nrow(df1)) %>%
  rename(row_1 = Var1, row_2 = Var2) %>%
  rowwise() %>%
  mutate(similarity = row_cf(row_1, row_2, df1)) %>%
  filter(row_1 != row_2) %>%
  group_by(row_1) %>%
  slice(which.max(similarity))

 #join to known data table

 df1 %>% mutate(row_1 = 1:n()) %>%
   left_join(Sim)

This is my modified attempt but it doesn't quite cut the job. If I use the join table, I lose the row names still.

My idea out put is to have

    Row_1   Row_2  Similarity

    A        C       90%
    B        E       90%
    C        J       88%
    D        N       80%
    E        Y       70%
    F        G       60%

The reason that I want to keep ID is eventually I want to see what samples are most similar like the above post did, but I also want to sort them out into 8 clusters based on that similarity, so to achieve the final stable 8 clusters of samples. How could I sort this segmentation out? run a hierarchal clustering?

解决方案

I don't think it's a good idea to replace the NAs with a code, as that would be to assume that all the NAs are the same, which I don't think is appropriate. Your choice of similarity metric is good, but as it's symmetric we can avoid half the comparisons.

Example data

set.seed(1)

Sample <- LETTERS[1:18]
r <- sort(rep(1:6, 3))

df <- replicate(20, {
    ix <- sample(1:length(r), 7)
    r[ix] <- sample(r[ix], 7, rep=TRUE)
    r
})

df[sample(1:length(df), 40)] <- NA
df <- cbind(Sample, data.frame(df), stringsAsFactors=FALSE)

Calculating pairwise Hamming distance

pair <- t(combn(1:nrow(df), 2))
similarity <- numeric(nrow(pair))
id <- matrix("", nrow(pair), 2)

m <- matrix(NA, nrow(df), nrow(df))
dimnames(m) <- list(df[,1], df[,1])

hamming <- function(a, b) {
    sum(a == b, na.rm=TRUE)/length(a)
}

for (i in 1:nrow(pair)) {
    r <- pair[i,]
    similarity[i] <- hamming(df[r[1], -1], df[r[2], -1]) 
    id[i, ] <- df[r, 1]
    m[id[i, , drop=FALSE]] <- similarity[i]
}

out <- data.frame(id, similarity, stringsAsFactors=FALSE)
out <- out[order(similarity, decreasing=TRUE), ]
rownames(out) <- NULL

head(out)
#   X1 X2 similarity
# 1  B  C       0.60
# 2  A  B       0.50
# 3  M  N       0.45
# 4  P  R       0.45
# 5  A  C       0.40
# 6  G  H       0.40

kmeans(as.dist(t(m)), 4)$cluster
# A B C D E F G H I J K L M N O P Q R 
# 1 1 1 2 2 2 2 4 2 2 2 2 4 4 4 3 3 3 

plot(hclust(1-as.dist(t(m))))

这篇关于R:计算数据框的行方向相似度,并根据该相似度对数据进行排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆