算法/库文本相似性 [英] algorithm/library for text similarity

查看：104 发布时间：2015/11/30 20:50:58 c++ algorithm text

本文介绍了算法/库文本相似性的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要实现算法（或找到一个在开放源码库）文本相似的评价。我需要一个有效的算法在给定任意两个组的文档（相对少量的文本的大块的）以创建它们之间的匹配对 - 这文件是最有可能被从其中之一产生

I need to implement algorithm (or find one in an open source library) for evaluation of text similarities. I need an efficient algorithm for given two arbitrary sets of documents (relatively small number of big chunks of text) it to create a matching pairs between them - which document is most likely to be produced from which one.

我相信我会分裂这两个 - 确定每对的相似系数 - 然后应用一些分配问题的算法。而对于分配算法，我可以找到解决的好一些，我不能找到一个很好的一个用于计算相似系数。

I believe I will split this in two - defining the similarity coefficient of every pair - and then applying some of the assignment problem algorithms. While for the assignment algorithms I can find a good number of solutions I cannot find a good one for the computing the similarity coefficients.

请注意的文档不预先已知的 - 文本的计算索引（如果有）一定要快，以及

Note the documents are not known in advance - computing indexes of the text (if there is) must be fast as well.

我知道汉明距离，Levenshtein距离的其他一些算法串差。这不是我所期待的，但 - 我使用这个词的文本字符串，而不是故意

I am aware of Hamming distance, Levenshtein distance some of the other algorithms for string difference. This is not what I am looking for though - I am using the word text instead string on purpose.

我不是在寻找词组搜索算法，以及什么样的Lucene和Xapian的图书馆是为做（至少看起来是）。

I am not looking for phrase search algorithms as well what libraries like Lucene and Xapian are made for (at least seems to be).

可能是一些基于TF-IDF。

Probably something based on tf–idf.

我想问题是，有一些已经解决了这个问题，或者是有可能的库，如使用lucete做到这一点。

I guess the question is, is there something that already solves this problem or is it possible libraries like lucete to be used to do that.

推荐答案

下面是我会做什么为出发点（只是因为它是简单和快速）：

Here is what I would do as a starting point (just because it is simple and fast):

在使用共享的地图或地图的hash_map的话数
对于每个文本，打造字级卦数对应的图
比较重叠

我们可以假定词典尺寸为＆lt; 1M线（由或21位），因此我们可以只连接在一个Int64 codeA卦。

We can assume that the dictionary size is < 1m (or 21bit), so we can just encode a trigram in an int64.

void CountTrigrams(const vector<string>& words, 
                   map<string, int> * dict, 
                   map<int64, int> * result) {
  int64 trigram = 0;
  for (int i = 0; i < words.size(); i++) {
    const& word = words[i];
    int id;
    auto di = dict->find(word);
    if (di == dict->end()) {
      id = dict.size();
      dict[word] = id;
    } else {
      id = di->second;
    }
    trigram = ((trigram << 21) | id) & 0x7fffffffffffffff;
    if (i > 2) {
      auto ti = result->find(trigram);
      if (ti == result->end()) {
        result[trigram] = 1;
      } else {
        ti->second++;
      }
    }
  }
}

然后比较结果每对：

Then compare the results for each pair:

int Compare(const map<int64, int> & t1, const map<int64, int> & t2) {
  int score = 0;
  for (auto i = t1.first(); i != t1.end(); i++) {
    auto j = t2.find(t1->first);
    if (j != t2.end()) {
      score += MAX(i->second, j->second);
    }
  }
  return score;
}

这可能是有意义的某种方式规范化分数，如通过卦的总数除以

It may make sense to normalize the score somehow, e.g. divide by the total number of trigrams.

这篇关于算法/库文本相似性的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

算法/库文本相似性 [英] algorithm/library for text similarity

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

算法/库文本相似性 [英] algorithm/library for text similarity

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭