如何在不删除R中存在NA的行的情况下执行聚类 [英] How to perform clustering without removing rows where NA is present in R

查看:145
本文介绍了如何在不删除R中存在NA的行的情况下执行聚类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据,这些数据的元素中包含一些NA值。
我想做的是在不删除行的情况下执行群集
,其中不存在NA。

I have a data which contain some NA value in their elements. What I want to do is to perform clustering without removing rows where the NA is present.

我知道雏菊中的 gower 距离测量这样的情况。
但是为什么我下面的代码不起作用?
除了雏菊以外,我还欢迎其他选择。

I understand that gower distance measure in daisy allow such situation. But why my code below doesn't work? I welcome other alternatives than 'daisy'.

# plot heat map with dendogram together.

library("gplots")
library("cluster")


# Arbitrarily assigning NA to some elements
mtcars[2,2] <- "NA"
mtcars[6,7]  <- "NA"

 mydata <- mtcars

hclustfunc <- function(x) hclust(x, method="complete")

# Initially I wanted to use this but it didn't take NA
#distfunc <- function(x) dist(x,method="euclidean")

# Try using daisy GOWER function 
# which suppose to work with NA value
distfunc <- function(x) daisy(x,metric="gower")

d <- distfunc(mydata)
fit <- hclustfunc(d)

# Perform clustering heatmap
heatmap.2(as.matrix(mydata),dendrogram="row",trace="none", margin=c(8,9), hclust=hclustfunc,distfun=distfunc);

我收到的错误消息是:

    Error in which(is.na) : argument to 'which' is not logical
Calls: distfunc.g -> daisy
In addition: Warning messages:
1: In data.matrix(x) : NAs introduced by coercion
2: In data.matrix(x) : NAs introduced by coercion
3: In daisy(x, metric = "gower") :
  binary variable(s) 8, 9 treated as interval scaled
Execution halted

最后,我想对NA允许的数据执行分层聚类。

At the end of the day, I'd like to perform hierarchical clustering with the NA allowed data.

更新

使用 as.numeric 进行转换,适用于上面的示例。
但是为什么从文本文件中读取该代码失败?

Converting with as.numeric work with example above. But why this code failed when read from text file?

library("gplots")
library("cluster")

# This time read from file
mtcars <- read.table("http://dpaste.com/1496666/plain/",na.strings="NA",sep="\t")

# Following suggestion convert to numeric
mydata <- apply( mtcars, 2, as.numeric )

hclustfunc <- function(x) hclust(x, method="complete")
#distfunc <- function(x) dist(x,method="euclidean")
# Try using daisy GOWER function 
distfunc <- function(x) daisy(x,metric="gower")

d <- distfunc(mydata)
fit <- hclustfunc(d)

heatmap.2(as.matrix(mydata),dendrogram="row",trace="none", margin=c(8,9), hclust=hclustfunc,distfun=distfunc);

我得到的错误是:

  Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
Error in hclust(x, method = "complete") : 
  NA/NaN/Inf in foreign function call (arg 11)
Calls: hclustfunc -> hclust
Execution halted

推荐答案

该错误是由于数据中存在非数字变量(编码为字符串的数字)引起的。
您可以将它们转换为数字:

The error is due to the presence of non-numeric variables in the data (numbers encoded as strings). You can convert them to numbers:

mydata <- apply( mtcars, 2, as.numeric )
d <- distfunc(mydata)

这篇关于如何在不删除R中存在NA的行的情况下执行聚类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆