如何在mahout中以存储为CSV的矢量数据执行k均值聚类? [英] How to perform k-means clustering in mahout with vector data stored as CSV?

查看:167
本文介绍了如何在mahout中以存储为CSV的矢量数据执行k均值聚类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含数据向量的文件,其中每一行包含一个逗号分隔的值列表.我想知道如何使用mahout在此数据上执行k均值聚类. Wiki中提供的示例提到了创建sequenceFiles,但是否则我不确定是否需要进行某种类型的转换才能获取这些sequenceFiles.

I have a file containing vectors of data, where each row contains a comma-separated list of values. I am wondering how to perform k-means clustering on this data using mahout. The example provided in the wiki mentions creating sequenceFiles, but otherwise I am not sure if I need to do some type of conversion in order to obtain these sequenceFiles.

推荐答案

我建议手动从CSV文件中读取条目,从中创建NamedVectors,然后使用序列文件编写器将向量写入序列文件中.从那里开始,KMeansDriver运行方法应该知道如何处理这些文件.

I would recommend manually reading in the entries from the CSV file, creating NamedVectors from them, and then using a sequence file writer to write the vectors in a sequence file. From there on, the KMeansDriver run method should know how to handle these files.

序列文件编码键值对,因此键将是样本的ID(应为字符串),并且值是向量周围的VectorWritable包装器.

Sequence files encode key-value pairs, so the key would be an ID of the sample (it should be a string), and the value is a VectorWritable wrapper around the vectors.

以下是有关如何执行此操作的简单代码示例:

Here is a simple code sample on how to do this:

    List<NamedVector> vector = new LinkedList<NamedVector>();
    NamedVector v1;
    v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one");
    vector.add(v1);

    Configuration config = new Configuration();
    FileSystem fs = FileSystem.get(config);

    Path path = new Path("datasamples/data");

    //write a SequenceFile form a Vector
    SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);
    VectorWritable vec = new VectorWritable();
    for(NamedVector v:vector){
        vec.set(v);
        writer.append(new Text(v.getName()), v);
    }
    writer.close();

此外,我建议您阅读行动中的问题的第8章.它提供了有关Mahout中数据表示的更多详细信息.

Also, I would recommend reading chapter 8 of Mahout in Action. It gives more details on data representation in Mahout.

这篇关于如何在mahout中以存储为CSV的矢量数据执行k均值聚类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆