MongoDB如何在文本索引和文本分数中处理文档长度? [英] How does MongoDB handle document length in a text index and text score?

查看:106
本文介绍了MongoDB如何在文本索引和文本分数中处理文档长度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含文本数量差异很大的文档的集合,看起来具有更多文本的文档可以获得更高的文本分数。当然,文档中的文字越多,关键字显示的次数就越多。然而,这并不一定意味着它与文本较少的文档相关或多或少。

有人知道MongoDB如何计算长度或数量在计算相关性时,在文档中显示文本吗?

我搜索并搜索了MongoDB文档,但无法找到描述性答案。



从GitHub上的MongoDB 3.2源代码片段( src / mongo / db / fts / fts_spec.cpp ):

  for(ScoreHelperMap :: const_iterator i = terms.begin(); i!= terms.end(); ++ i){
const string& term = i-> first;
const ScoreHelperStruct& data = i-> second;

//为了将权重调整为期限计数的函数,因为它
//涉及总字段长度。即。这是唯一的词还是
//经常发生的词?或者它只在
中显示一次//长文本块?

double coeff =(0.5 * data.count / numTokens)+ 0.5;

//如果期限与
//原始表单(未经过处理的)的原始表单相同,则给予小幅提升。
double adjustment = 1;
if(raw.size()== term.length()&& raw.equalCaseInsensitive(term))
adjustment + = 0.1;

double& score =(* docScores)[term];
分数+ =(重量*数据.freq * coeff *调整);
验证(分数<= MAX_WEIGHT);


设置一些测试数据以查看一个非常简单的例子:长度系数:

  db.articles.insert([
{headline:Rock} ,
{标题:岩石},
{标题:岩石纸},
{标题:石头剪刀},
])

db.articles.createIndex({headline:text})

db.articles.find(
{$ text:{$ search:rock} },
{_id:0,标题:1,score:{$ meta:textScore}}
).sort({score:{$ meta:textScore}})

注释结果:

  //原始字段与索引字段的完全匹配
// Coefficent为1,对于原始字段
{
标题:Rock,
得分:1.1
}

//词干术语与索引字段的匹配(岩石指岩石)
// Coefficent为1
{
标题:岩石,
分数:1
}

//两个术语,一个匹配
// //系数为0.75:(0.5 * 1匹配/ 2个术语)+ 0.5
{
标题 :岩石纸,
得分:0.75
}

//三个术语,一个匹配
// //系数为0.66:(0.5 * 1匹配/ 3学期)+ 0.5
{
标题:岩石剪刀,
分数:0.6666666666666666
}


I have a collection that has documents of widely varying amounts of text and it appears that documents with more text get significantly higher textScores. Of course, the more text in the document the more times the keyword shows. That, however, doesn't necessarily mean that it is more or less relevant than a document with less text.

Does anyone know how MongoDB accounts for the length or amount of text in a document when calculating the relevance?

I googled and scoured the MongoDB docs but can't find a descriptive answer.

解决方案

Scoring is based on the number of stemmed matches, but there is also a built-in coefficient which adjusts the score for matches relative to total field length (with stopwords removed). If your longer text includes more relevant words to a query, this will add to the score. Longer text which does not match a query will reduce the score.

Snippet from MongoDB 3.2 source code on GitHub (src/mongo/db/fts/fts_spec.cpp):

   for (ScoreHelperMap::const_iterator i = terms.begin(); i != terms.end(); ++i) {
        const string& term = i->first;
        const ScoreHelperStruct& data = i->second;

        // in order to adjust weights as a function of term count as it
        // relates to total field length. ie. is this the only word or
        // a frequently occuring term? or does it only show up once in
        // a long block of text?

        double coeff = (0.5 * data.count / numTokens) + 0.5;

        // if term is identical to the raw form of the
        // field (untokenized) give it a small boost.
        double adjustment = 1;
        if (raw.size() == term.length() && raw.equalCaseInsensitive(term))
            adjustment += 0.1;

        double& score = (*docScores)[term];
        score += (weight * data.freq * coeff * adjustment);
        verify(score <= MAX_WEIGHT);
    }
}

Setting up some test data to see the effect of the length coefficient on a very simple example:

db.articles.insert([
    { headline: "Rock" },
    { headline: "Rocks" },
    { headline: "Rock paper" },
    { headline: "Rock paper scissors" },
])

db.articles.createIndex({ "headline": "text"})

db.articles.find(
    { $text: { $search: "rock" }},
    { _id:0, headline:1, score: { $meta: "textScore" }}
).sort({ score: { $meta: "textScore" }})

Annotated results:

// Exact match of raw term to indexed field
// Coefficent is 1, plus 0.1 bonus for identical match of raw term
{
  "headline": "Rock",
  "score": 1.1
}

// Match of stemmed term to indexed field ("rocks" stems to "rock")
// Coefficent is 1
{
  "headline": "Rocks",
  "score": 1
}

// Two terms, one matching
// Coefficient is 0.75: (0.5 * 1 match / 2 terms) + 0.5
{
  "headline": "Rock paper",
  "score": 0.75
}

// Three terms, one matching
// Coefficient is 0.66: (0.5 * 1 match / 3 terms) + 0.5
{
  "headline": "Rock paper scissors",
  "score": 0.6666666666666666
}

这篇关于MongoDB如何在文本索引和文本分数中处理文档长度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆