用r基本聚类 [英] basic clustering with r

查看：111 发布时间：2020/10/3 2:15:19 r cluster-analysis hierarchical-clustering

本文介绍了用r基本聚类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是R和数据分析的新手。我正在尝试为网站创建一个简单的自定义推荐系统。因此，作为输入信息，我有用户单击的 user / session-id，item-id，item-price 。

I'm new to R and data analysis. I'm trying to create a simple custom recommendation system for a web site. So, as input information I have user/session-id,item-id,item-price which users clicked at.

c165c2ee-81cf-48cf-ba3f-83b70204c00c    161785  124.0
a886fdd5-7cee-4152-b1b7-77a2702687b0    643339  42.0
5e5fd670-b104-445b-a36d-b3798cd43279    131332  38.0
888d736f-99bc-49ca-969d-057e7d4bb8d1    1032763 39.0

我想对数据进行聚类分析。

I would like to apply cluster analysis to that data.

如果我尝试对数据应用k-均值聚类：

If I try to apply k-means clustering to my data:

> q <- kmeans(dat, centers=25)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message:
In kmeans(dat, centers = 25) : NAs introduced by coercion

如果我尝试应用分层聚类数据：

If I try to apply hierarchial clustering to the data:

> m <- as.matrix(dat)
> d <- dist(m)   # find distance matrix
Warning message:
In dist(m) : NAs introduced by coercion

强制引入的NA似乎发生，因为第一列不是数字。因此，我尝试对 dat [-1] 运行代码，但结果是相同的。

The "NAs introduced by coercion" seems to happen as a first column is not a number. So, I've tried to run the code against dat[-1] but result is the same.

我想念什么或做错什么了？

What am I missing or doing wrong?

多谢了。

===更新＃1 ===

=== UPDATE #1 ===

str和factor的输出

Output on str and factor:

> str(dat)
'data.frame':   14634 obs. of  3 variables:
 $ V3 : Factor w/ 10062 levels "000880bf-6cb7-4c4a-9a9d-1c0a975b52ba",..: 7548 6585 3670 5336 9181 6429 62 410 7386 9409 ...
 $ V8 : Factor w/ 5561 levels "1000120","1000910",..: 835 3996 443 65 1289 2084 582 695 3666 4787 ...
 $ V12: Factor w/ 395 levels "100.0","101.0",..: 25 278 249 256 352 249 1 88 361 1 ...

> dat[,1] = factor(dat[,1])
> str(dat)
'data.frame':   14634 obs. of  3 variables:
 $ V3 : Factor w/ 10062 levels "000880bf-6cb7-4c4a-9a9d-1c0a975b52ba",..: 7548 6585 3670 5336 9181 6429 62 410 7386 9409 ...
 $ V8 : Factor w/ 5561 levels "1000120","1000910",..: 835 3996 443 65 1289 2084 582 695 3666 4787 ...
 $ V12: Factor w/ 395 levels "100.0","101.0",..: 25 278 249 256 352 249 1 88 361 1 ...

> dd <- dist(dat)
Warning message:
In dist(dat) : NAs introduced by coercion
> hc <- hclust(dd)                # apply hirarchical clustering
Error in hclust(dd) : NA/NaN/Inf in foreign function call (arg 11)

===更新＃2 ===

=== UPDATE #2 ===

我不想在那里删除第一列可能是同一用户的多次点击，我认为这对分析很重要。

I would not like to remove the first column as there could be multiple clicks for the same user which I consider to be important for the analysis.

推荐答案

听起来您想保留第一列（即使14634个观测值的10062水平很高）。将因子转换为数值的方法是使用 model.matrix 函数。在转换因子之前：

It sounds like you want to retain the first column (even though 10062 levels for 14634 observations is quite high). The way to convert a factor to numeric values is with the model.matrix function. Before converting your factor:

data(iris)
head(iris)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

在 model.matrix 之后：

head(model.matrix(~.+0, data=iris))
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
# 1          5.1         3.5          1.4         0.2             1                 0                0
# 2          4.9         3.0          1.4         0.2             1                 0                0
# 3          4.7         3.2          1.3         0.2             1                 0                0
# 4          4.6         3.1          1.5         0.2             1                 0                0
# 5          5.0         3.6          1.4         0.2             1                 0                0
# 6          5.4         3.9          1.7         0.4             1                 0                0

如您所见，它扩展了因子值。这样，您就可以在数据的扩展版本上运行k-means聚类了：

As you can see, it expands out your factor values. So you could then run k-means clustering on the expanded version of your data:

kmeans(model.matrix(~.+0, data=iris), centers=3)
# K-means clustering with 3 clusters of sizes 49, 50, 51
# 
# Cluster means:
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa Speciesversicolor Speciesvirginica
# 1     6.622449    2.983673     5.573469    2.032653             0         0.0000000       1.00000000
# 2     5.006000    3.428000     1.462000    0.246000             1         0.0000000       0.00000000
# 3     5.915686    2.764706     4.264706    1.333333             0         0.9803922       0.01960784
# ...

这篇关于用r基本聚类的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用r基本聚类 [英] basic clustering with r

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用r基本聚类 [英] basic clustering with r

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭