Mahout是否提供确定内容之间相似性的方法(针对基于内容的建议)? [英] Does Mahout provide a way to determine similarity between content (for content-based recommendations)?

查看:89
本文介绍了Mahout是否提供确定内容之间相似性的方法(针对基于内容的建议)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Mahout是否提供确定内容之间相似性的方法?

Does Mahout provide a way to determine similarity between content?

我想将基于内容的推荐作为Web应用程序的一部分.我知道Mahout擅长采用用户评分矩阵并根据它们来生成推荐,但是我对协作(基于评分的)推荐不感兴趣.我想对两段文字的匹配程度进行评分,然后推荐与我在其用户个人资料中为用户存储的文字最匹配的项目...

I would like to produce content-based recommendations as part of a web application. I know Mahout is good at taking user-ratings matrices and producing recommendations based off of them, but I am not interested in collaborative (ratings-based) recommendations. I want to score how well two pieces of text match and then recommend items that match most closely to text that I store for users in their user profile...

我已经阅读了Mahout的文档,并且看起来它主要是在促进基于协作的(基于评分的)建议,而不是基于内容的建议……是真的吗?

I've read Mahout's documentation, and it looks like it facilitates mainly the collaborative (ratings-based) recommendations, but not content-based recommendations... Is this true?

推荐答案

那不是完全正确的. Mahout没有基于内容的推荐器,但是具有用于根据内容计算项目之间相似度的算法.最受欢迎的一种是TF-IDF和余弦相似度.但是,计算不是即时进行的,而是脱机完成的.您需要hadoop才能根据内容更快地计算成对相似度.我要编写的步骤适用于MAHOUT 0.8.我不确定他们是否在0.9中进行了更改.

That is not entirely true. Mahout does not have content-based recommender, but it does have algorithms for computing similarities between items based on the content. One of the most popular one is TF-IDF and cosine similarity. However, the computation is not on the fly, but is done offline. You need hadoop to compute the pairwise similarities based on the content more faster. The steps I am going to write are for MAHOUT 0.8. I am not sure if they changed it in 0.9.

步骤1.您需要将文本文档转换为seq文件.我在MAHOUT-0.8中丢失了此命令,但在0.9中是这样的(请检查您的MAHOUT版本):

Step 1. You need to convert your text documents into seq files. I lost the command for this in MAHOUT-0.8, but in 0.9 is something like this (Please check it for your version of MAHOUT):

$MAHOUT_HOME/bin/mahout seqdirectory
--input <PARENT DIR WHERE DOCS ARE LOCATED> --output <OUTPUT DIRECTORY>
<-c <CHARSET NAME OF THE INPUT DOCUMENTS> {UTF-8|cp1252|ascii...}>
<-chunk <MAX SIZE OF EACH CHUNK in Megabytes> 64>
<-prefix <PREFIX TO ADD TO THE DOCUMENT ID>>

第2步.您需要将序列文件转换为稀疏向量,如下所示:

Step 2. You need to convert your sequence files into sparse vectors like this:

$MAHOUT_HOME/bin/mahout seq2sparse \
   -i <SEQ INPUT DIR> \
   -o <VECTORS OUTPUT DIR> \
   -ow -chunk 100 \
   -wt tfidf \
   -x 90 \
   -seq \
   -ml 50 \
   -md 3 \
   -n 2 \
   -nv \
   -Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000

其中:

  • 是文件的大小.
  • x 应该出现该术语的最大数量,以被视为词典文件的一部分.如果出现的次数少于-x,则将其视为停用词.
  • wt 是加权方案.
  • md 该词应被视为词典文件一部分的最小文档数.频率较低的任何术语都将被忽略.
  • n 在Lp空间中使用的归一化值.规范化的详细说明在第8.4节中给出.默认方案是不对权重进行归一化. 2对于我们在聚类中使用的余弦距离和相似度都很好
  • nv 获取命名向量,使其他数据文件更易于检查.
  • chunk is the size of the of the file.
  • x the maximum number of the term should occur in to be considered a part of the dictionary file. If it occurs less than -x, then it is considered as a stop word.
  • wt is weighting scheme.
  • md The minimum number of documents the term should occur in to be considered a part of the dictionary file. Any term with lesser frequency is ignored.
  • n The normalization value to use in the Lp space. A detailed explanation of normaliza- tion is given in section 8.4. The default scheme is to not normalize the weights. 2 is good for cosine distance, which we are using in clustering and for similarity
  • nv to get named vectors making further data files easier to inspect.

步骤3.根据向量创建矩阵:

Step 3. Create a matrix from the vectors:

$MAHOUT_HOME/bin/mahout rowid -i <VECTORS OUTPUT DIR>/tfidf-vectors/part-r-00000 -o <MATRIX OUTPUT DIR>

第4步.为上述矩阵的每一行创建一个类似文档的集合.这将生成与集合中每个文档最相似的50个文档.

Step 4. Create a collection of similar docs for each row of the matrix above. This will generate the 50 most similar docs to each doc in the collection.

 $MAHOUT_HOME/bin/mahout rowsimilarity -i <MATRIX OUTPUT DIR>/matrix -o <SIMILARITY OUTPUT DIR> -r <NUM OF COLUMNS FROM THE OUTPUT IN STEP 3> --similarityClassname SIMILARITY_COSINE -m 50 -ess -Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000

这将产生一个文件,每个文件之间的相似度均基于内容.前50个文件.

This will produce a file with similarities between each item with the top 50 files based on the content.

现在,要在推荐过程中使用它,您需要读取文件或将其加载到数据库中,具体取决于您拥有多少资源.我使用Collection<GenericItemSimilarity.ItemItemSimilarity>加载到主存储器中.这是两个对我有用的简单函数:

Now, to use it in your recommendation process you need to read the file or load it into database, depending of how much resources you have. I loaded into main memory using Collection<GenericItemSimilarity.ItemItemSimilarity>. Here are two simple functions that did the job for me:

public static Collection<GenericItemSimilarity.ItemItemSimilarity> correlationMatrix(final File folder, TIntLongHashMap docIndex) throws IOException{
        Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = 
                new ArrayList<GenericItemSimilarity.ItemItemSimilarity>();

        ItemItemSimilarity itemItemCorrelation = null;

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        int n=0;
        for (final File fileEntry : folder.listFiles()) {
            if (fileEntry.isFile()) {
                if(fileEntry.getName().startsWith("part-r")){

                    SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(fileEntry.getAbsolutePath()), conf);

                    IntWritable key = new IntWritable();
                    VectorWritable value = new VectorWritable();
                    while (reader.next(key, value)) {

                        long itemID1 = docIndex.get(Integer.parseInt(key.toString()));

                        Iterator<Element> it = value.get().nonZeroes().iterator();

                        while(it.hasNext()){
                            Element next = it.next();
                            long itemID2 =  docIndex.get(next.index());
                            double similarity =  next.get();
                            //System.out.println(itemID1+ " : "+itemID2+" : "+similarity);

                            if (similarity < -1.0) {
                                similarity = -1.0;
                            } else if (similarity > 1.0) {
                                similarity = 1.0;
                            }


                            itemItemCorrelation = new GenericItemSimilarity.ItemItemSimilarity(itemID1, itemID2, similarity);

                            corrMatrix.add(itemItemCorrelation);
                        }
                    }
                    reader.close();
                    n++;
                    logger.info("File "+fileEntry.getName()+" readed ("+n+"/"+folder.listFiles().length+")");
                }
            }
        }

        return corrMatrix;
    }


public static TIntLongHashMap getDocIndex(String docIndex) throws IOException{
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        TIntLongHashMap map = new TIntLongHashMap();
        SequenceFile.Reader docIndexReader = new SequenceFile.Reader(fs, new Path(docIndex), conf);

        IntWritable key = new IntWritable();
        Text value = new Text();
        while (docIndexReader.next(key, value)) {
            map.put(key.get(), Long.parseLong(value.toString()));
        }

        return map;
    }

最后,在您的推荐课程中,您将其称为:

At the end, in your recommendation class you call this:

TIntLongHashMap docIndex = ItemPairwiseSimilarityUtil.getDocIndex(filename);
TLongObjectHashMap<TLongDoubleHashMap> correlationMatrix = ItemPairwiseSimilarityUtil.correlatedItems(folder, docIndex);

其中filename是您的docIndex文件名,而folder是项目相似性文件的文件夹.最后,这不过是基于项目的推荐.

Where filename is your docIndex filename, and folder is the folder of the item-similarity files. At the end, this is nothing more than item-item based recommendation.

希望这可以为您提供帮助

Hope this can help you

这篇关于Mahout是否提供确定内容之间相似性的方法(针对基于内容的建议)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆