我如何在无痛的脚本Elasticsearch 5.3中这样做 [英] How can I do this in painless script Elasticsearch 5.3

查看:1518
本文介绍了我如何在无痛的脚本Elasticsearch 5.3中这样做的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试复制此ES插件



具体而言,v代表该字段的代表是该字段条款的tf / idf值。



因此,我们可以使用termvectors索引文档,使用矢量的索引作为术语。如果我们重复N次,我们表示向量的值,利用得分公式的tf部分。
这意味着向量的域应该被转换并重新排列在{1 .. Infinite}正整数数字域中。我们从1开始,所以我们确定所有的文档都包含所有的术语,这样可以更容易地利用公式。



例如,向量: 21,54,45]可以使用简单的空白分析器作为文档中的字段进行索引,并具有以下值:

  { 
@model_factor:0<重复21次> 1<重复54次> 2<重复45次>,
name:Test 1
}

然后查询,即计算点积,我们提高了表示向量索引位置的单个项。



所以使用与输入向量相同的示例:[45,1,1]将在查询中转换:

 should:[
{
term:{
@model_factor:{
value 0,
boost:45
}
}
},
{
term:{
@model_factor:1// boost:1默认

}
},
{
term:{
@model_factor:2// boost:1默认
}
}
]

规范(t,d)应为,以便在上面的公式中不使用它。 idf部分是所有文档的常量,因为它们都包含所有的术语(所有的向量都是相同的维度)。



queryNorm(q)与上述公式中的所有文档相同,因此不是问题。



coord(q,d)是一个常数,因为所有文档都包含所有条款。



strong>选项3的问题



需要测试。



它只适用于正面数字向量,请参阅,使其也可以用于负数。



它不是一个点产品的完全相同,但非常接近找到类似的文档基于原始向量。



大向量维度上的可扩展性在查询时可能是一个问题,因为这意味着我们需要使用不同提升的N dim项查询。 p>

我将尝试使用测试索引,并用结果编辑此问题。


We're trying to replicate this ES plugin https://github.com/MLnick/elasticsearch-vector-scoring. The reason is AWS ES doesn't allow any custom plugin to be installed. The plugin is just doing dot product and cosine similarity so I'm guessing it should be really simple to replicate that in painless script. It looks like groovy scripting is deprecated in 5.0.

Here's the source code of the plugin.

    /**
     * @param params index that a scored are placed in this parameter. Initialize them here.
     */
    @SuppressWarnings("unchecked")
    private PayloadVectorScoreScript(Map<String, Object> params) {
        params.entrySet();
        // get field to score
        field = (String) params.get("field");
        // get query vector
        vector = (List<Double>) params.get("vector");
        // cosine flag
        Object cosineParam = params.get("cosine");
        if (cosineParam != null) {
            cosine = (boolean) cosineParam;
        }
        if (field == null || vector == null) {
            throw new IllegalArgumentException("cannot initialize " + SCRIPT_NAME + ": field or vector parameter missing!");
        }
        // init index
        index = new ArrayList<>(vector.size());
        for (int i = 0; i < vector.size(); i++) {
            index.add(String.valueOf(i));
        }
        if (vector.size() != index.size()) {
            throw new IllegalArgumentException("cannot initialize " + SCRIPT_NAME + ": index and vector array must have same length!");
        }
        if (cosine) {
            // compute query vector norm once
            for (double v: vector) {
                queryVectorNorm += Math.pow(v, 2.0);
            }
        }
    }

    @Override
    public Object run() {
        float score = 0;
        // first, get the ShardTerms object for the field.
        IndexField indexField = this.indexLookup().get(field);
        double docVectorNorm = 0.0f;
        for (int i = 0; i < index.size(); i++) {
            // get the vector value stored in the term payload
            IndexFieldTerm indexTermField = indexField.get(index.get(i), IndexLookup.FLAG_PAYLOADS);
            float payload = 0f;
            if (indexTermField != null) {
                Iterator<TermPosition> iter = indexTermField.iterator();
                if (iter.hasNext()) {
                    payload = iter.next().payloadAsFloat(0f);
                    if (cosine) {
                        // doc vector norm
                        docVectorNorm += Math.pow(payload, 2.0);
                    }
                }
            }
            // dot product
            score += payload * vector.get(i);
        }
        if (cosine) {
            // cosine similarity score
            if (docVectorNorm == 0 || queryVectorNorm == 0) return 0f;
            return score / (Math.sqrt(docVectorNorm) * Math.sqrt(queryVectorNorm));
        } else {
            // dot product score
            return score;
        }
    }

I'm trying to start with just getting a field from index. But I'm getting error.

Here's the shape of my index.

I've enabled delimited_payload_filter

"settings" : {
    "analysis": {
            "analyzer": {
               "payload_analyzer": {
                  "type": "custom",
                  "tokenizer":"whitespace",
                  "filter":"delimited_payload_filter"
                }
      }
    }
 }

And I have a field called @model_factor to store a vector.

{
    "movies" : {
        "properties" : {
            "@model_factor": {
                            "type": "text",
                            "term_vector": "with_positions_offsets_payloads",
                            "analyzer" : "payload_analyzer"
                     }
        }
    }
}

And this is the shape of the document

{
    "@model_factor":"0|1.2 1|0.1 2|0.4 3|-0.2 4|0.3",
    "name": "Test 1"
}

Here's how I use the script

{
    "query": {
        "function_score": {
            "query" : {
                "query_string": {
                    "query": "*"
                }
            },
            "script_score": {
                "script": {
                    "inline": "def termInfo = doc['_index']['@model_factor'].get('1', 4);",
                    "lang": "painless",
                    "params": {
                        "field": "@model_factor",
                        "vector": [0.1,2.3,-1.6,0.7,-1.3],
                        "cosine" : true
                    }
                }
            },
            "boost_mode": "replace"
        }
    }
}

And this is the error I got.

"failures": [
      {
        "shard": 2,
        "index": "test",
        "node": "ShL2G7B_Q_CMII5OvuFJNQ",
        "reason": {
          "type": "script_exception",
          "reason": "runtime error",
          "caused_by": {
            "type": "wrong_method_type_exception",
            "reason": "wrong_method_type_exception: cannot convert MethodHandle(List,int)int to (Object,String)String"
          },
          "script_stack": [
            "termInfo = doc['_index']['@model_factor'].get('1',4);",
            "              ^---- HERE"
          ],
          "script": "def termInfo = doc['_index']['@model_factor'].get('1',4);",
          "lang": "painless"
        }
      }
    ]

The question is how do I access the index field to get @model_factor in painless scripting?

解决方案

Option 1

Due to the fact that @model_factor is a text field, in painless scripting, it would be possible to access it, setting fielddata=true in the mapping. So the mapping should be:

{
    "movies" : {
        "properties" : {
            "@model_factor": {
                            "type": "text",
                            "term_vector": "with_positions_offsets_payloads",
                            "analyzer" : "payload_analyzer",
                            "fielddata" : true
                     }
        }
    }
}

And then it can be scored accessing doc-values:

{
    "query": {
        "function_score": {
            "query" : {
                "query_string": {
                    "query": "*"
                }
            },
            "script_score": {
                "script": {
                    "inline": "return Double.parseDouble(doc['@model_factor'].get(1)) * params.vector[1];",
                    "lang": "painless",
                    "params": {
                        "vector": [0.1,2.3,-1.6,0.7,-1.3]
                    }
                }
            },
            "boost_mode": "replace"
        }
    }
}

Problems with Option 1

So it is possible to access the field data value setting fielddata=true, but in this case, the value is the vector index as a term, not the value of the vector which is stored in the payload. Unfortunately, it looks like there is no way to access the Token Payload (where the real vector index value is stored) using painless scripting and doc-values. See the source code for elasticsearch and another similar question re: accessing term info.

So the answer is that using painless scripting is NOT possible to access the payload.

I tried also to store the vector values with a simple pattern tokenizer but when accessing the term vector values the order is not preserved, and this is probably the reason for which the author of the plugin decided to use the term as a string and then retrieve the position 0 of the vector as the term "0" and then find the real vector value in the payload.

Option 2

A very simple alternative is to use n fields in the documents, each of them represents a position in the vector, so in your example, we have a 5 dim vector with values stored in v0...v4 directly as double:

{
    "@model_factor":"0|1.2 1|0.1 2|0.4 3|-0.2 4|0.3",
    "name": "Test 1",
    "v0" : 1.2,
    "v1" : 0.1,
    "v2" : 0.4,
    "v3" : -0.2,
    "v4" : 0.3
} 

and then the painless scripting should be:

{
    "query": {
        "function_score": {
            "query" : {
                "query_string": {
                    "query": "*"
                }
            },
            "script_score": {
                "script": {
                    "inline": "return doc['v0'].getValue() * params.vector[0];",
                    "lang": "painless",
                    "params": {
                        "vector": [0.1,2.3,-1.6,0.7,-1.3]
                    }
                }
            },
            "boost_mode": "replace"
        }
    }
}

It should be easily possible to iterate on the input vector length and get the fields dynamically to calculate the dot product modifying doc['v0'].getValue() * params.vector[0] that I wrote for simplicity.

Problems with Option2

Option 2 is viable as long as the vector dimension remains not big. I think that default Elasticsearch max number of fields per document is 1000, but it can be changed also in AWS environment:

curl -X PUT \
  'https://.../indexName/_settings' \
  -H 'cache-control: no-cache' \
  -H 'content-type: application/json' 
  -d '{
"index.mapping.total_fields.limit": 2000
}'

Moreover, it should be tested also the script speed on a large number of documents. Maybe in re-scoring / re-ranking scenarios, it is a viable solution.

Option 3

The third option is really an experiment and the most fascinating in my opinion. It tries to exploit the internal Elasticsearch representation of the Vector Space Model and does not use any scripting to score but reuse the default similarity score based on tf/idf.

Lucene, that seats at Elasticsearch core, is already using internally a modification of the cosine similarity to calculate the similarity score between documents in his Vector Space Model representation of terms as the formula below, taken from the TFIDFSImilarity javadoc, shows:

In particular, the weights of the vector representing the field are the tf/idf values of the terms of that field.

So we could index a document with termvectors, using as term the index of the vector. If we repeat it N times, we represent the value of the vector, exploiting the tf part of the scoring formula. This means that the domain of the vector should be transformed and rescaled in {1.. Infinite} Positive Integer numbers domain. We start from 1 so that we are sure that all the documents contain all the terms, it will make it easier to exploit the formula.

For example, the vector: [21, 54, 45] can be indexed as a field in a document using a simple whitespace analyzer and the following value:

{
    "@model_factor" : "0<repeated 21 times> 1<repeated 54 times> 2<repeated 45 times>",
    "name": "Test 1"
}

then to query, i.e. calculate the dot product, we boost the single terms that represent the index position of the vector.

So using the same example above the input vector: [45, 1, 1] will be transformed in the query:

"should": [
        {
          "term": {
            "@model_factor": {
              "value": "0",
              "boost": 45 
            }
          }
        },
        {
          "term": {
            "@model_factor": "1" // boost:1 by default

          }
        },
        {
          "term": {
            "@model_factor": "2"  // boost:1 by default
          }
        }
      ]

norm(t,d) should be disabled in the mapping so that it is not used in the formula above. The idf part is constant for all the documents because all of them contains all the terms (having all the vectors the same dimension).

queryNorm(q) is the same for all the documents in the formula above so it is not a problem.

coord(q,d) is a constant because all the documents contain all the terms.

Problems with Option 3

Need to be tested.

It works only for positive numbers vectors, see this question in math stackoverflow for making it works also for negative numbers.

It is not the exact same of a dot product but very close to find similar documents based on raw vectors.

Scalability on large vector dimension can be an issue at querying time because this means we need to do a N dim terms query with different boosting.

I will try it in a test index and edit this question with the results.

这篇关于我如何在无痛的脚本Elasticsearch 5.3中这样做的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆