解释mahout clusterdumper的输出 [英] Interpreting output from mahout clusterdumper

查看:161
本文介绍了解释mahout clusterdumper的输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对已爬网页面(超过25K个文档;个人数据集)进行了群集测试。
我已经完成了一个clusterdump:

$ $ p $ $ MAHOUT_HOME / bin / mahout clusterdump --seqFileDir output / clusters-1 / - 输出clusteranalyze.txt

运行集群卸载程序后的输出显示为25个元素VL-xxxxx {}:

  VL-24130 {n = 1312 c = [0:0.017,10:0.007,11: 14:0.017,31:0.016,35:0.006,41:0.010,43:0.008,52:0.005,59:0.010,68:0.037,72:0.056,87:0.028,r = [0:0.442 ,10:0.271,11:0.198,14:0.369,31:0.421,...]} 
...
VL-24868 {n = 311 c = [0:0.042,11:0.016 ,17:0.046,72:0.014,96:0.044,118:0.015,135:0.016,195:0.017,318:0.040,319:0.037,320:0.036,330:0.030,...]] r = [0 :0.740,11:0.287,17:0.576,72:0.239,96:0.549,118:0.273,...]}

如何解释此输出?



简而言之:我正在寻找属于特定群集的文档ID。

这是什么意思:


  • VL-x?

  • n = yc = [z:z',。 ..]

  • r = [z'':z''',...]


    是否0:0.017表示0是属于此群集的文档ID?



    我已经阅读过mahout wiki页面CL,n,c和r的含义。但是,有人可以更好地向我解释这些信息,还是可以指出一个详细解释的资源?



    对不起,如果我提出一些愚蠢的问题,但我是一个新手,用apache mahout并将它用作聚类课程任务的一部分。 b $ b

    解决方案


    1. 默认情况下,kmeans集群使用不包含数据点名称的WeightedVector。所以,你想使用NamedVector自己创建一个序列文件。 seq文件的数量和映射任务之间有一对一的对应关系。因此,如果您的映射容量为12,您希望在seqfiles
      NamedVecotr中将数据切分为12个片段:


      $ b

        vector = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField [0]); 


    2. 基本上,您需要从HDFS系统下载clusteredPoints并编写自己的代码来输出结果。这是我写的输出集群点成员资格的代码。

        import java.io. *; 
      import java.util.ArrayList;
      import java.util.HashMap;
      import java.util.List;
      import java.util.Map;
      import java.util.Set;
      import java.util.TreeMap;

      导入org.apache.hadoop.conf.Configuration;
      导入org.apache.hadoop.fs.FileSystem;
      导入org.apache.hadoop.fs.Path;
      import org.apache.hadoop.io.IntWritable;
      import org.apache.hadoop.io.SequenceFile;
      import org.apache.mahout.clustering.WeightedVectorWritable;
      import org.apache.mahout.common.Pair;
      import org.apache.mahout.common.iterator.sequencefile.PathFilters;
      import org.apache.mahout.common.iterator.sequencefile.PathType;
      import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
      import org.apache.mahout.math.NamedVector;

      public class ClusterOutput {

      / **
      * @param args
      * /
      public static void main(String [] args ){
      // TODO自动生成的方法存根
      尝试{
      BufferedWriter bw;
      Configuration conf = new Configuration();
      FileSystem fs = FileSystem.get(conf);
      文件pointsFolder = new File(args [0]);
      文件文件[] = pointsFolder.listFiles();
      bw = new BufferedWriter(new FileWriter(new File(args [1])));
      HashMap< String,Integer> clusterIds;
      clusterIds = new HashMap< String,Integer>(5000);
      for(File file:files){
      if(file.getName()。indexOf(part-m)<0)
      continue;
      SequenceFile.Reader reader = new SequenceFile.Reader(fs,new Path(file.getAbsolutePath()),conf);
      IntWritable key = new IntWritable();
      WeightedVectorWritable value = new WeightedVectorWritable(); $(b)b while(reader.next(key,value)){
      NamedVector vector =(NamedVector)value.getVector();
      String vectorName = vector.getName();
      bw.write(vectorName +\ t+ key.toString()+\\\
      );
      if(clusterIds.containsKey(key.toString())){
      clusterIds.put(key.toString(),clusterIds.get(key.toString())+ 1);
      }
      else
      clusterIds.put(key.toString(),1);
      }
      bw.flush();
      reader.close();
      }
      bw.flush();
      bw.close();
      bw = new BufferedWriter(new FileWriter(new File(args [2])));
      Set< String>键= clusterIds.keySet();
      for(String key:keys){
      bw.write(key ++ clusterIds.get(key)+\\\
      );
      }
      bw.flush();
      bw.close();
      } catch(IOException e){
      e.printStackTrace();
      }
      }
      }



    I ran a clustering test on crawled pages (more than 25K docs ; personal data set). I've done a clusterdump :

    $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-1/ --output clusteranalyze.txt
    

    The output after running cluster dumper is shown 25 elements "VL-xxxxx {}" :

    VL-24130{n=1312 c=[0:0.017, 10:0.007, 11:0.005, 14:0.017, 31:0.016, 35:0.006, 41:0.010, 43:0.008, 52:0.005, 59:0.010, 68:0.037, 72:0.056, 87:0.028, ... ] r=[0:0.442, 10:0.271, 11:0.198, 14:0.369, 31:0.421, ... ]}
    ...
    VL-24868{n=311 c=[0:0.042, 11:0.016, 17:0.046, 72:0.014, 96:0.044, 118:0.015, 135:0.016, 195:0.017, 318:0.040, 319:0.037, 320:0.036, 330:0.030, ...] ] r=[0:0.740, 11:0.287, 17:0.576, 72:0.239, 96:0.549, 118:0.273, ...]}
    

    How to interpret this output?

    In short : I am looking for document ids which belong to a particular cluster.

    What is the meaning of :

    • VL-x ?
    • n=y c=[z:z', ...]
    • r=[z'':z''', ...]

    Does 0:0.017 means "0" is the document id which belongs to this cluster?

    I already have read on mahout wiki-pages what CL, n, c and r means. But can someone please explain them to me better or points to a resource where it is explained a bit more in detail?

    Sorry, if i am asking some stupid questions, but i am a newbie wih apache mahout and using it as part of my course assignment for clustering.

    解决方案

    1. By default, kmeans clustering uses WeightedVector which does not include the data point name. So, you would like to make a sequence file yourself using NamedVector. There is a one to one correspondence between the number of seq files and the mapping tasks. So if your mapping capacity is 12, you want to chop your data into 12 pieces when making seqfiles NamedVecotr:

      vector = new NamedVector(new SequentialAccessSparseVector(Cardinality),arrField[0]);
      

    2. Basically you need to download the clusteredPoints from your HDFS system and write your own code to output the results. Here is the code that I wrote to output the cluster point membership.

      import java.io.*;
      import java.util.ArrayList;
      import java.util.HashMap;
      import java.util.List;
      import java.util.Map;
      import java.util.Set;
      import java.util.TreeMap;
      
      import org.apache.hadoop.conf.Configuration;
      import org.apache.hadoop.fs.FileSystem;
      import org.apache.hadoop.fs.Path;
      import org.apache.hadoop.io.IntWritable;
      import org.apache.hadoop.io.SequenceFile;
      import org.apache.mahout.clustering.WeightedVectorWritable;
      import org.apache.mahout.common.Pair;
      import org.apache.mahout.common.iterator.sequencefile.PathFilters;
      import org.apache.mahout.common.iterator.sequencefile.PathType;
      import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;
      import org.apache.mahout.math.NamedVector;
      
      public class ClusterOutput {
      
      /**
       * @param args
       */
      public static void main(String[] args) {
              // TODO Auto-generated method stub
              try {
                      BufferedWriter bw;
                      Configuration conf = new Configuration();
                      FileSystem fs = FileSystem.get(conf);
                      File pointsFolder = new File(args[0]);
                      File files[] = pointsFolder.listFiles();
                      bw = new BufferedWriter(new FileWriter(new File(args[1])));
                      HashMap<String, Integer> clusterIds;
                      clusterIds = new HashMap<String, Integer>(5000);
                      for(File file:files){
                              if(file.getName().indexOf("part-m")<0)
                                      continue;
                              SequenceFile.Reader reader = new SequenceFile.Reader(fs,  new Path(file.getAbsolutePath()), conf);
                              IntWritable key = new IntWritable();
                              WeightedVectorWritable value = new WeightedVectorWritable();
                              while (reader.next(key, value)) {
                                      NamedVector vector = (NamedVector) value.getVector();
                                      String vectorName = vector.getName();
                                      bw.write(vectorName + "\t" + key.toString()+"\n");
                                      if(clusterIds.containsKey(key.toString())){
                                              clusterIds.put(key.toString(), clusterIds.get(key.toString())+1);
                                      }
                                      else
                                              clusterIds.put(key.toString(), 1);
                              }
                              bw.flush();
                              reader.close(); 
                      }
                      bw.flush();
                      bw.close();
                      bw = new BufferedWriter(new FileWriter(new File(args[2])));
                      Set<String> keys=clusterIds.keySet();
                      for(String key:keys){
                              bw.write(key+" "+clusterIds.get(key)+"\n");
                      }
                      bw.flush();
                      bw.close();
                      } catch (IOException e) {
                              e.printStackTrace();
                      }
              }
      }
      

    这篇关于解释mahout clusterdumper的输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆