如何阅读Mahout集群输出 [英] How to read Mahout clustering output

查看:69
本文介绍了如何阅读Mahout集群输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经对Mahout教程中的综合控制数据运行了k-Means聚类算法,并且想知道是否有人可以解释如何解释输出.我运行了clusterdump并收到了类似以下的输出(为了节省空间而将其截断):

  CL-592 {n = 57 c = 30.726,29.813 ...] r = [3.528,3.597 ...]}重量:[道具-可选]:点数:1.0:[距离= 27.453962995925863]:[24.672,35.261,30.486 ...]1.0:[距离= 27.675053294846002]:[25.592,29.951,34.188 ...]1.0:[距离= 28.97727289419493]:[30.696、32.667、34.223 ...]1.0:[距离= 21.999685652862784]:[32.702,35.219,30.143 ...]...CL-598 {n = 50 c = [29.611,29.769 ...] r = [3.166,3.561 ...]}重量:[道具-可选]:点数:1.0:[距离= 27.266203490250472]:[27.679、33.506、23.594 ...]1.0:[距离= 28.749781351838173]:[34.727、28.325、30.331 ...]1.0:[距离= 32.635136046420186]:[27.758,33.859,29.879 ...]1.0:[距离= 29.328974057024624]:[29.356、26.793、25.575 ...] 

有人可以向我解释如何阅读吗?据我了解,CL -__是一个群集ID,后跟n =群集中的点数,c =质心作为矢量,r =半径作为矢量,然后是群集中的每个点.这样对吗?此外,我如何知道哪个聚类点与哪个输入点匹配?即这些点是否被描述为键值对,其中键是该点的某种ID,值是矢量?如果没有,我可以通过某种方式进行设置吗?

解决方案

我相信您对数据的解释是正确的(我与Mahout一起工作了大约3周,所以经验丰富的人可能应该考虑一下).

就链接指向创建它们的输入而言,我使用了 NamedVector ,其中名称是矢量的键.当您读取生成的点文件之一( clusteredPoints )时,可以将每一行(点向量)转换回 NamedVector ,并使用 .getName().

根据评论进行更新

最初将数据读入Mahout时,会将其转换为向量集合,然后将它们写入文件( points )以供稍后在聚类算法中使用.Mahout为您提供了几种可以使用的 Vector 类型,但是它们还使您可以访问名为 NamedVector Vector 包装器类,这将使您能够识别每个向量.

例如,您可以按如下方式创建每个 NamedVector :

  NamedVector nVec =新的NamedVector(新的SequentialAccessSparseVector(vectorDimensions),vectorName); 

然后,将您的 NamedVectors 集合写入文件,例如:

  SequenceFile.Writer writer =新的SequenceFile.Writer(...);VectorWritable可写= new VectorWritable();//接下来的两行将处于循环中,但为了清楚起见,我将其省略writable.set(nVec);writer.append(new Text(nVec.getName()),nVec); 

您现在可以将此文件用作一种聚类算法的输入.

在使用点文件运行一种聚类算法后,它将生成又一个点文件,但是它将位于名为 clusteredPoints 的目录中./p>

然后您可以读入该点文件并提取与每个矢量关联的名称.看起来像这样:

  IntWritable clusterId = new IntWritable();WeightedPropertyVectorWritable向量=新的WeightedPropertyVectorWritable();而(reader.next(clusterId,vector)){NamedVector nVec =(NamedVector)vector.getVector();//现在您可以使用nVec.getName()访问原始名称} 

I have run the k-Means clustering algorithm on the synthetic control data from the Mahout tutorial, and was wondering if someone could explain how to interpret the output. I ran clusterdump and received output that looks something like this (truncated to save space):

CL-592{n=57 c=30.726, 29.813...] r=[3.528, 3.597...]}
Weight : [props - optional]: Point:
1.0 : [distance=27.453962995925863]: [24.672, 35.261, 30.486...]
1.0 : [distance=27.675053294846002]: [25.592, 29.951, 34.188...]
1.0 : [distance=28.97727289419493]: [30.696, 32.667, 34.223...]
1.0 : [distance=21.999685652862784]: [32.702, 35.219, 30.143...]
...
CL-598{n=50 c=[29.611, 29.769...] r=[3.166, 3.561...]}
Weight : [props - optional]:  Point:
1.0 : [distance=27.266203490250472]: [27.679, 33.506, 23.594...]
1.0 : [distance=28.749781351838173]: [34.727, 28.325, 30.331...]
1.0 : [distance=32.635136046420186]: [27.758, 33.859, 29.879...]
1.0 : [distance=29.328974057024624]: [29.356, 26.793, 25.575...]

Could someone explain to me how to read this? From what I understand, CL-__ is a cluster ID, followed by n=number of points in the cluster, c=centroid as a vector, r=radius as a vector, and then each point in the cluster. Is this correct? Furthermore, how do I know which clustered point matches up with which input point? i.e. are the points described as a key-value pair where the key is some kind of ID for the point and the value is the vector? If not is there some way I can set it up so it is?

解决方案

I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).

As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().

Update in response to comment

When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.

For example, you could create each NamedVector as follows:

NamedVector nVec = new NamedVector(
    new SequentialAccessSparseVector(vectorDimensions), 
    vectorName
    );

Then you write your collection of NamedVectors to file with something like:

SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();

// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);

You can now use this file as input to one of the clustering algorithms.

After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.

You can then read in this points file and extract the name you associated to each vector. It'll look something like this:

IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();

while (reader.next(clusterId, vector))
{
    NamedVector nVec = (NamedVector)vector.getVector();
    // you now have access to the original name using nVec.getName()
}

这篇关于如何阅读Mahout集群输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆