亨利马乌:CSV以向量和运行程序 [英] Mahout: CSV to vector and running the program

查看:265
本文介绍了亨利马乌:CSV以向量和运行程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我分析与亨利马乌的K-means算法。我要运行一些测试,观察效果,​​并做一些统计的结果我得到的。

我想不出运行中亨利马乌我自己的程序的方式。但是,命令行界面可能是不够的。

要运行示例程序我

  $亨利马乌seqdirectory --input uscensus --output uscensus-SEQ
$象夫seq2sparse -i uscensus-SEQ -o uscensus-VEC
$象夫KMEANS -i路透VEC / TFIDF向量-o uscensus-k均值集群-c uscensus-k均值-重心-dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -OW -cl -k 25

该数据集是一个大的CSV文件。每一行是一个记录。特点是逗号分隔。第一字段是一个ID。
由于输入格式的我不能使用seqdirectory马上。
我想实现这个问题的答案类似的问题<一href=\"http://stackoverflow.com/questions/8785392/how-to-perform-k-means-clustering-in-mahout-with-vector-data-stored-as-csv\">How执行的k-means与存储为CSV矢量数据的聚类亨利马乌但我仍然有2个问题:


  1. 如何转换CSV到SeqFile?我想我可以写我自己
    使用Mahout中进行这种转换,然后使用其输出程序
    作为seq2parse输入。我想我可以使用CSVIterator(https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations)。我应该使用什么类读取和写?

  2. 如何构建和运行我的新计划?在行动或在这里与其他问题我无法用书亨利马乌看着办吧。


解决方案

对于SequenceFile格式让你的数据,你有几个可以采取的策略。两者都涉及编写您自己的code - 即没有严格的命令行

策略1
使用Mahout中的CSVVectorIterator类。你传递一个java.io.Reader中,它会在你的CSV文件读取,将每个行插入DenseVector。我从未使用过这一点,但看到它的API中。看起来直截了当不够的,如果你确定与DenseVectors。

策略2
编写您自己的解析器。这是很容易的,因为你只是在分割每行,你有一个数组,你可以通过循环。对于每行值的每个阵列,您可以使用这样的实例化一个向量:

 新DenseVector(小于阵列这里&GT;);

和它添加到列表(例如)。

然后......一旦你有载体列表,你可以写他们使用像这样SequenceFiles(我用下面code NamedVectors):

 文件系统FS = NULL;
SequenceFile.Writer作家;
配置的conf =新配置();清单&LT; NamedVector&GT;矢量= LT;这里是你从CSVVectorIterator&GT获得的载体清单;;//将数据写入SequenceFile
尝试{
    FS = FileSystem.get(CONF);    路径path =新路径(&lt;您的路径&GT + lt;您的文件名&GT;);
    作家=新SequenceFile.Writer(FS,CONF,路径,Text.class,VectorWritable.class);    VectorWritable VEC =新VectorWritable();
    对于(NamedVector矢量:dataVector中){        vec.set(矢量);
        writer.append(新文本(vector.getName()),VEC);    }
    writer.close();}赶上(例外五){
    的System.out.println(错误:+ E);
}

现在,您已经对SequenceFile格式的点,你可以用你的K-均值聚类的目录。您可以在此目录中输入点的命令行命令亨利马乌。

无论如何,这是一般的想法。可能有其他的方法为好。

I'm analysing the k-means algorithm with Mahout. I'm going to run some tests, observe performance, and do some statistics with the results I get.

I can't figure out the way to run my own program within Mahout. However, the command-line interface might be enough.

To run the sample program I do

$ mahout seqdirectory --input uscensus --output uscensus-seq
$ mahout seq2sparse -i uscensus-seq -o uscensus-vec
$ mahout kmeans -i reuters-vec/tfidf-vectors -o uscensus-kmeans-clusters -c uscensus-kmeans-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25

The dataset is one large CSV file. Each line is a record. Features are comma separated. The first field is an ID. Because of the input format I can not use seqdirectory right away. I'm trying to implement the answer to this similar question How to perform k-means clustering in mahout with vector data stored as CSV? but I still have 2 Questions:

  1. How do I convert from CSV to SeqFile? I guess I can write my own program using Mahout to make this conversion and then use its output as input for seq2parse. I guess I can use CSVIterator (https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations). What class should I use to read and write?
  2. How do I build and run my new program? I couldn't figure it out with the book Mahout in action or with other questions here.

解决方案

For getting your data in SequenceFile format, you have a couple of strategies you can take. Both involve writing your own code -- i.e., not strictly command-line.

Strategy 1 Use Mahout's CSVVectorIterator class. You pass it a java.io.Reader and it will read in your CSV file, turn each row into a DenseVector. I've never used this, but saw it in the API. Looks straight-forward enough if you're ok with DenseVectors.

Strategy 2 Write your own parser. This is really easy, since you just split each line on "," and you have an array you can loop through. For each array of values in each line, you instantiate a vector using something like this:

new DenseVector(<your array here>);

and add it to a List (for example).

Then ... once you have a List of Vectors, you can write them to SequenceFiles using something like this (I'm using NamedVectors in below code):

FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();

List<NamedVector> vectors = <here's your List of vectors obtained from CSVVectorIterator>;

// Write the data to SequenceFile
try {
    fs = FileSystem.get(conf);

    Path path = new Path(<your path> + <your filename>);
    writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);

    VectorWritable vec = new VectorWritable();
    for (NamedVector vector : dataVector) {

        vec.set(vector);
        writer.append(new Text(vector.getName()), vec);

    }
    writer.close();

} catch (Exception e) {
    System.out.println("ERROR: "+e);
}

Now you have a directory of "points" in SequenceFile format that you can use for your K-means clustering. You can point the command line Mahout commands at this directory as input.

Anyway, that's the general idea. There are probably other approaches as well.

这篇关于亨利马乌:CSV以向量和运行程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆