Kmeans聚类识别R中的知识 [英] Kmeans clustering identifying knowledge in R

查看:101
本文介绍了Kmeans聚类识别R中的知识的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是R和集群世界的新手.我正在使用购物数据集从中提取特征,以便识别出有意义的东西.

I am new to R and the clustering world. I am using a shopping dataset to extract features from it in order to identify something meaningful.

到目前为止,我已经设法学习了如何合并文件,删除na,进行误差平方和,锻炼平均值,按组汇总,进行K均值聚类并绘制结果X,Y.

So far I have managed to learn how to merge files, remove na., do the sum of errors squared, workout the mean values, summarise by group, do the K means clustering and plot the results X, Y.

但是,对于如何查看这些结果或确定什么是有用的群集,我感到非常困惑.我是重复某件事还是错过某件事?我也对绘制X Y变量感到困惑.

However, I am very confused on how to view these results or identify what would be a useful cluster? Am i repeating something or missing out on something? I get confused with plotting X Y variables aswell.

下面是我的代码,也许我的代码可能是错误的.能否请你帮忙.任何帮助将是巨大的.

Below is my code, maybe my code might be wrong. Could you please help. Any help would be great.

# Read file
mydata = read.csv(file.choose(), TRUE)

#view the file
View(mydata)

#create new data set
mydata.features = mydata

mydata.features <- na.omit(mydata.features)

wss <- (nrow(mydata.features)-1)*sum(apply(mydata.features,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata.features, centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

# K-Means Cluster Analysis
fit <- kmeans(mydata.features, 3) 
# get cluster means 
aggregate(mydata.features,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata.features <- data.frame(mydata.features, fit$cluster)

results <- kmeans(mydata.features, 3)

plot(mydata[c("DAY","WEEK_NO")], col= results$cluster

样本数据变量,下面是我在数据集中拥有的所有变量,它的购物数据集是在2年内收集的

Sample data Variables, below are all the variables I have within my dataset, its shopping dataset collected over 2 years

PRODUCT_ID-唯一标识每个产品 Household_key-唯一标识每个家庭 BASKET_ID-唯一标识购买时机 DAY-交易发生的日期 数量-旅途中购买的产品数量 SALES_VALUE-零售商从销售中获得的美元金额 STORE_ID-标识唯一的商店 RETAIL_DISC-由于制造优惠券而应用了折扣 TRANS_TIME-交易发生的时间 WEEK_NO-发生一周的交易1-102 制造商-将相同制造商的产品链接在一起的代码 部门-将相似的产品分组在一起 品牌-表示私人或国家标签乐队 COMMODITY_DESC-将较低级别的相似产品分组在一起 SUB_COMMODITY_DESC-将最低级别的相似产品分组在一起

PRODUCT_ID - uniquely identifies each product household_key - uniquely identifies each household BASKET_ID - uniquely identifies a purchase occasion DAY - day when transaction occured QUANTITY - number of products purchased during the trip SALES_VALUE - amount of dollar retailers receive from sales STORE_ID - identifies unique stores RETAIL_DISC - disccount applied due to manufacture coupon TRANS_TIME - time of day when the transaction occurred WEEK_NO - week of transaction occurred 1-102 MANUFACTURER - code that links products with same manufacture together DEPARTMENT - groups similar products together BRAND - indicates private or national label band COMMODITY_DESC - groups similar products together at the lower level SUB_COMMODITY_DESC - groups similar products together at the lowest level

推荐答案

样本数据

我整理了一些示例数据,以便为您提供更好的帮助:

I put together some sample data, so I can help you better:

#generate sample data
sampledata <- matrix(data=rnorm(200,0,1),50,4)

#add ID to data
sampledata <-cbind(sampledata, 1:50)

#show data:
head(sampledata)
            [,1]       [,2]       [,3]       [,4] [,5]
[1,]  0.72859559 -2.2864943 -0.5408501  0.1564730    1
[2,]  0.34852943  0.3100891  0.6007349 -0.5985266    2
[3,] -0.04605026  0.5067896 -0.2911211 -1.1617171    3
[4,] -1.88358617  1.3739440 -0.5655383  0.9518367    4
[5,]  0.35528650 -1.7482304 -0.3871520 -0.7837712    5
[6,]  0.38057682  0.1465488 -0.6006462  1.3827544    6

我有一个包含数据点的矩阵.每个数据点都有4个变量(第1-4列)和一个ID(第5列).

I have a matrix with data points. Each data point has 4 variables (column 1 - 4) and an id (column 5).

应用K均值

此后,我应用了k-means函数(但仅应用于列1:4,因为对id进行聚类没有多大意义):

After that I apply the k-means function (but only to column 1:4 since it doesnt make much sense to cluster the id):

#kmeans (4 centers)
result <- kmeans(sampledata[,1:4], 4)

分析输出

如果我想查看哪个数据点属于哪个群集,我可以输入:

if i want to see what data point belongs to which cluster i can type:

result$cluster

结果将例如:

[1] 4 3 2 2 1 2 4 4 3 3 3 3 2 1 4 4 4 2 4 4 4 1 1 1 3 3 3 3 1 3 2 2 4 4 2 4 2 3 1 2 2 2 1 2 1 1 4 1 1 1

这意味着数据点1属于群集4.第二个数据点属于群集3,依此类推... 如果要检索群集1中的所有数据点,可以执行以下操作:

This means that data point 1 belongs to cluster 4. The second data point belongs to cluster 3, and so on... If I want to retrieve all data points that are in cluster 1, i can do the following:

sampledata[result$cluster==1,]

这将输出一个矩阵,在最后一列中包含所有值和数据点ID:

This will output a matrix, with all the values and the Data Point Id in the last Column:

            [,1]         [,2]       [,3]        [,4] [,5]
 [1,]  0.3552865 -1.748230422 -0.3871520 -0.78377121    5
 [2,]  0.5806156  0.479576142  1.1314052  1.60730796   14
 [3,]  1.1871472  1.280881477 -1.7227361 -0.89045074   22
 [4,]  0.8482060  0.726470349  0.6851352 -0.78526581   23
 [5,] -0.5324139 -1.745802580  0.6779943  0.99915708   24
 [6,]  0.2472263 -0.006298136 -0.1457003 -0.44789364   29
 [7,]  0.1412812 -0.247076976  0.9181507 -0.58570904   39
 [8,]  0.1859786 -1.768692166  0.5681229 -0.80618157   43
 [9,] -1.1577178 -0.179886998  1.5183880  0.40014071   45
[10,]  1.0667566 -1.602875994  0.6010581 -0.49514049   46
[11,]  0.2464646  1.226129859 -1.3628096 -0.37666716   48
[12,]  1.2660358  0.282688323  0.7650636  0.23442255   49
[13,] -0.2499337  0.855327072  0.2290221  0.03492807   50

如果我想知道集群1中有多少个数据点,我可以输入:

If i want to know how many data points are in cluster 1, I can type:

sum(result$cluster==1)

这将返回13,与上面矩阵中的行数相对应.

This will return 13, and corresponds to the number of lines in the matrix above.

最后是一些绘图:

首先,让我们绘制数据.由于您具有多维数据框,并且只能在标准图中绘制两个尺寸,因此您必须这样做.选择要绘制的变量,例如var 2和3(第2列和第3列).这对应于:

First, lets plot the data. Since you have a multidimensional dataframe, and you can only plot two dimensions in a standard plot, you have to do it like this. Select the variables you want to plot, For example var 2 and 3 (column 2 and 3). This corresponds to:

sampledata[,2:3]

要绘制此数据,只需编写:

To plot this data, simply write:

plot(sampledata[,2:3], col=result$cluster ,main="Affiliation of observations") 

使用argumemnt col(代表颜色),通过键入col = result $ cluster

use the argumemnt col (this stands for colors) to give the data points a color accordingly to their cluster affiliation by typing col= result$cluster

如果您还想查看图中的聚类中心,请添加以下行:

If you also want to see the cluster centers in the plot, add the following line:

+ points(result$centers, col=1:4, pch="x", cex=3)

现在该图应如下所示(对于变量2与变量3):

The plot should now look like this (for variable 2 vs variable 3):

(点是数据点,X是聚类中心)

(The dots are the data points, the X´s are the cluster centers)

这篇关于Kmeans聚类识别R中的知识的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆