Lucene-用于对JSON密钥对值编制索引的自定义分析器/令牌生成器 [英] Lucene - custom analyzer/tokenizer to index JSON key pair values

查看:92
本文介绍了Lucene-用于对JSON密钥对值编制索引的自定义分析器/令牌生成器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的目标是存储和索引JSON密钥对值. 理想情况下,我会将它们存储在一个常量字段名中. (为简单起见,"GRADES")

传入的JSON对象的示例:

    "Data": [{
        "Key": "DP01",
        "Value": "Excellent"
    }, {
        "Key": "DP02",
        "Value": "Average"
    }, {
        "Key": "DP03",
        "Value": "Negative"
    }]

该JSON对象将按原样进行序列化和存储,但是我想对其进行索引,以使我能够通过键和值在同一字段内进行搜索.主要思想是在同一个Lucene字段中搜索多个值.

有关如何构建索引的任何建议? 让我们想象一下,例如,我想使用以下查询进行搜索:

[GRADES: "key:DP01 UNIQUEIDasDELIMITER value:Excellent"]

客户分析仪/令牌器如何实现这一目标?

尝试更准确地描述我的目标.

考虑这种典型的关系类型(为简单起见).

  • 每个文档都是一个网站.

  • 一个网站可以包含多个图像(和其他重要的元数据).

  • 每张图像都有多组免费的键值对属性:

    {
        "Key": "Scenery",
        "Value": "Nature"
    }, {
        "Key": "Style",
        "Value": "Vintage"
    }
    

  • 另一组:

    {
        "Key": "Scenery",
        "Value": "Industrial"
    }, {
        "Key": "Style",
        "Value": "Vintage"
    }
    

我面临的挑战来自相似的结构类型,并以一种使我能够构建查询的方式对其进行索引:

一个风景如画的网站:工业和风格:复古.

我可能采取了安迪·普克(Andy Pook)指出的错误方法.有什么想法可以有效地展平这些属性吗?

解决方案

常见的问题"是将索引和文档视为具有一致的字段集.它与具有固定列集的表的关系数据库不同.

在以前的生活中,我有一个带有一组属性"的实体.键/值集合(非常类似于您的成绩).

创建的每个文档都具有为每个属性命名的字段,即属性",其值添加为"NOT_ANALYZED".

因此,在您的示例中,我将创建类似

的字段

new Field("grade-"+gradeID, grade, Field.Store.NO, Field.Index.NOT_ANALYZED)

然后,您可以使用"grade-DP01:excellent"之类的查询进行搜索.

或者,您可以仅具有固定的字段名称(类似于@ cris-almodovar),并将值设置为"id = grade"之类的值.再次NOT_ANALYZED.搜索"grade:DP01 =优秀".

任何一种都可以.我成功地使用了这两种方法,但通常更喜欢第一种.

其他响应于编辑...

我想我理解问题所在...如果您使用的是风景=工业风格=年份"和风景=自然风格=现代",那么如果您搜索自然与年份",就不希望它匹配. ?

您可以通过KeywordAnalyzer为每个集合添加一个"imageType"字段,其值应为"scenery =工业样式=年份abc = xyz"(按空格分割).

然后使用imageType:"scenery=industrial style=vintage"~2搜索.使用斜率短语可确保值在同一字段中,并且斜率允许顺序不同或存在额外的值.您必须根据每个字段中期望的属性数量来计算数量.简而言之,如果您期望最大为N个值,则斜率也应该为N.

I'm aiming to store and index JSON key pair values. Ideally I would store them in a constant fieldname. (For simplicity sake, "GRADES")

An example of the incoming JSON object:

    "Data": [{
        "Key": "DP01",
        "Value": "Excellent"
    }, {
        "Key": "DP02",
        "Value": "Average"
    }, {
        "Key": "DP03",
        "Value": "Negative"
    }]

The JSON object would be serialized and stored as it is, but I would like to index it in a way to enable me to search within that same field by key and value. The main idea is to search multiple values within the same Lucene Field.

Any suggestions on how to structure the indexing? Lets imagine for example that I would like to search using following query:

[GRADES: "key:DP01 UNIQUEIDasDELIMITER value:Excellent"]

How would a customer analyzer/tokenizer achieve this ?

EDIT: An attempt to depict my goal more accurately.

Think of this typical relational type of structure (for simplicity sake).

  • Each document is a website.

  • A website can have multiple images (and other important metadata).

  • Each image has multiple sets of free keyvaluepair properties:

    {
        "Key": "Scenery",
        "Value": "Nature"
    }, {
        "Key": "Style",
        "Value": "Vintage"
    }
    

  • Another set:

    {
        "Key": "Scenery",
        "Value": "Industrial"
    }, {
        "Key": "Style",
        "Value": "Vintage"
    }
    

My challenge is come from a similar type of structure and index it in a way which enables me to build queries such as:

A website with an image of scenery:industrial and style:vintage.

I'm probably taking the wrong approach as indicated by Andy Pook. Any ideas how to efficiently flatten out these properties?

解决方案

A common "problem" is to think about indexes and documents as having a consistent set of fields. It is not the same as a relational database with tables of a fixed set of columns.

in a previous life I had an entity with a set of "attributes". A key/value collection (much like your grades).

Each document was created with fields named for each attribute ie "attr-thing" with the value added "NOT_ANALYZED".

So, in your example I'd create fields like

new Field("grade-"+gradeID, grade, Field.Store.NO, Field.Index.NOT_ANALYZED)

Then you can search with a query like "grade-DP01:excellent".

Alternatively you can just have a fixed field name (similar to @cris-almodovar) and set the value to something like "id=grade". Again NOT_ANALYZED. The search for "grade:DP01=excellent".

Either will work. I've used both approaches with success but typically prefer the first.

Additional in response to edit...

I think I understand the problem... If you had "scenery=industrial style=vintage" and "scenery=nature style=modern" you wouldn't want it to match if you searched "nature AND vintage", right?

You could add an "imageType" field for each set with a value like "scenery=industrial style=vintage abc=xyz" with the KeywordAnalyzer (just splits by space).

Then search with imageType:"scenery=industrial style=vintage"~2. Using a slop phrase guarantees that the values are in the same field and the slop allows for the order to be different or for there to be extra values. The number you'd have to figure out based on the number of properties you expect in each field. Simplistically, if you expect for there to be a max of N values then the slop should be N too.

这篇关于Lucene-用于对JSON密钥对值编制索引的自定义分析器/令牌生成器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆