具有向量名称的转储聚类结果 [英] Dumping clustering result with vectors names

查看:78
本文介绍了具有向量名称的转储聚类结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经按照由于我使用的是Mahout 0.7,因此clusterdump命令无法按照《 Mahout在行动》中的描述进行操作,但是我可以像这样操作它:

Since I'm using Mahout 0.7, the clusterdump command didn't work as described in Mahout in Action, but I got it to work like this:

export HADOOP_CLASSPATH=/path/to/mahout-distribution-0.7/core/target/mahout-core-0.7-job.jar:/path/to/mahout-distribution-0.7/integration/target/mahout-integration-0.7.jar
hadoop jar core/target/mahout-core-0.7-job.jar org.apache.mahout.utils.clustering.ClusterDumper -i /clustering/out/clusters-20-final -o textout -of TEXT

我得到的是这样的一行:

and I am getting lines like this one:

VL-1383471{n=192 c=[0.180, -0.087, 0.281, 0.512, 0.678, 1.833, 2.613, 0.313, 0.226, 1.023, 0.229, -0.104, -0.461, -0.553, -0.318, 0.315, 0.658, 0.245, 0.635, 0.220, 0.660, 0.193, 0.277, -0.182, 0.497, 0.346, 0.658, 0.660, 0.191, 0.660, 0.636, 0.018, 0.519, 0.335, 0.535, 0.008, -0.028, 0.461, 0.229, 0.287, 0.619, 0.509, 0.566, 0.389, -0.075, -0.180, -0.461, 0.381, -0.108, 0.126, -0.728] r=[0.983, 0.890, 0.384, 0.823, 0.702, 0.000, 0.000, 1.132, 0.605, 0.979, 0.897, 0.862, 0.438, 0.546, 0.390, 0.171, 0.257, 0.234, 0.251, 0.106, 0.257, 0.093, 0.929, 0.077, 0.204, 0.218, 0.257, 0.257, 0.258, 0.257, 0.249, 0.112, 0.217, 0.157, 0.284, 0.197, 0.228, 0.229, 0.323, 0.401, 0.248, 0.217, 0.269, 1.002, 0.819, 0.706, 0.412, 0.964, 0.787, 0.872, 0.172]}

这对我来说还没有用,因为我需要每个簇中向量的名称. 我看到为文本文档创建了一个字典文件.如何为我的数据创建字典?

which is not yet useful to me, since I need the names of my vectors in each cluster. I saw that for text documents a dictionary file is created. How would I create a dictionary for my data?

此外,使用-of CSV给我一个空文件,我做错了什么吗?

Also, using -of CSV gives me an empty file, am I doing something wrong?

我进行的另一种尝试是直接访问cluster-20-final/part-m-00000文件,就像在.事实证明,它不包含WeightedVectorWritable而是ClusterWritable,从中可以获取Cluster实例,但不包含任何实际包含的Vector.

Another attempt I took was to directly access the cluster-20-final/part-m-00000 file, like it's done in listing 7.2 of Mahout in Action. Turns out it doesn't contain WeightedVectorWritable but ClusterWritable, from which I can get the Cluster instance but not any actual contained Vector.

推荐答案

有点晚,但这可能会在某个时间某处对某人有所帮助.

A bit late, but this might help someone somewhere, sometime.

运行时

KMeansDriver.run(input, clustersIn, outputPath, measure, convergenceDelta, maxIterations, true, 0.0, false);

输出之一是名为clusteredPoints的目录.那里有一个零件文件,其中包含所有按聚类的聚类向量.这意味着像这样的东西

One of the outputs was a directory called clusteredPoints. There is a part file there with all the clustered vectors by cluster. This means that something like this

    IntWritable key = new IntWritable();
    WeightedVectorWritable value = new WeightedVectorWritable();

    Path clusteredPoints = new Path(output + "/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000");

    FileSystem fs = FileSystem.get(clusteredPoints.toUri(), new Configuration());

    try (SequenceFile.Reader reader = new SequenceFile.Reader(fs, clusteredPoints, fs.getConf())) {

        while (reader.next(key, value)) {
            // Do something useful here
            ((NamedVector) value.getVector()).getName();
        }

    } catch (Throwable t) {
        throw t;
    }

似乎可以解决问题.使用这样的方法,在使用k-means聚类和Mahout进行测试时,我能够很好地了解聚类的地方.

seems to do the trick. Using something like this, I was able to get a good sense of what was clustered where when running my tests with k-means clustering and Mahout.

当我这样做时,我正在使用Mahout 0.8.

I was using Mahout 0.8 when I did this.

这篇关于具有向量名称的转储聚类结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆