在Lucene.NET索引JSON对象数组 [英] Indexing Json Object Arrays in Lucene.NET

查看:155
本文介绍了在Lucene.NET索引JSON对象数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

只是想添加一个道歉,一个TL:在这里开始DR问题,它已经演变有点一段时间......但希望有人关心读它......,希望我不会是唯一的一个到底是谁得到的答案...

Just wanted to add an apology for a TL:DR question here at the beginning, it has evolved a bit over time... But hopefully someone cares to read it... And hopefully I won't be the only one who gets an answer in the end...

我的工作,把任意的JSON对象变成Lucene.NET指数,给定一个对象可能看起来像:

I am working on putting arbitrary json objects into a Lucene.NET index, given an object that might look like:

{
  name: "Tony",
  age: 40,
  address: {
     street: "Weakroad",
     number: 10,
     floor: 2,
     door: "Left"
  },
  skills: [ 
    { name: ".NET", level: 5, experience: 12 },
    { name: "JavaScript", level: 3, experience: 6 },
    { name: "HTML5", level: 4, experience: 6 },
    { name: "Lucene.NET", level: 1, experience: 12 },
    { name: "C#", level: 10, experience: 12 }
  ],
  aliases: [ "Bucks", "SirTalk", "BeemerBoy" ]
}

这会产生以下字段:

"name": "Tony"
"age": "40"
"address.street": "Weakroad"
"address.number": "10"
"address.floor": "2"
"address.door": "Left"
"skills": ???
"aliases": "Bucks SirTalk BeemerBoy" //should turn into 3 tokens.

正如您可能注意到的技巧有???,因为现在我不知道如何处理与...如果有,甚至是任何有意义-通用的方式来做到这一点...

As you may noticed skills has a ???, because right now I am not sure how to deal with that... And if there even is any "meaningful-generic" way to do it...

这里有一些选择,我已经能够思考:

Here are some options I have been able to think about:

1)串联:但后来我将失去做对Lucene的更高级的查询,比如寻找4级以上AFAIK使用.NET技能的人员的能力吗?

1) Concatenation: But then I will lose the ability to do more advanced queries against Lucene, like finding persons with .NET skills above level 4 AFAIK?

有关澄清,连接可能是这样的:

For clarification, concatenation could be something like:

"skills": ".NET, JavaScript, HTML5, Lucene.NET, C#"

弃号作为他们不会在这种情况下太大的意义。如果一个孩子对象aditional的特性是会被收集,以及串...另一种方法是独立Concat的每个字段:

Discarding numbers as they wouldn't make much sense in this case. If aditional properties on a child object was a string that would have been gathered as well... An alternative would be to concat each field independently:

"skills.name": ".NET, JavaScript, HTML5, Lucene.NET, C#"
"skills.level": "5, 3, 4, 1, 10"
"skills.experience": "12, 6, 6, 12, 12"

同样的数字并不能使所有的多大意义在这里,倒是他们只是提供了一个例子。

Again numbers doesn't make all that much sense here, but added them just for providing an example.

2)链接文件:创建一个新文档公关。一回参照本文件数组项,这可能会奏效,但没有新的特性嵌套文档和BlockJoinQuery还没有被移植到.NET版本然而,这听起来真的凌乱+这听起来像它会罐性能。同时它也将杀死文档得分的用处,我认为这可能是问题不大,但。

2) Linked Documents: Creating a new document pr. array entry with a back reference to this document, this might work but without new features as Nested Documents and BlockJoinQuery which hasn't been ported to the .NET version yet this really sounds messy + it sounds like it would tank performance. While it would also kill the usefulness of document scoring, I think that might be less of an issue though.

基本上是一个文件将包含一个存储领域充当外键,每当一个搜索发现,文档中,我们将拿起引用文档来代替。

Basically a document would contain a stored field acting as a foreign key, whenever a search found that document we would pick up the referenced document instead.

所以,如果我们说明了文件,他们将是:

So if we illustrate documents they would be:

//Primary Document - ContentType: Person
"$id": 1
"$doctype": Primary
"name": "Tony"
...etc
"skills": [ 2, 3 ] //Just a stored field for retrieving data

//Child Document - ContentType: Skill
"$id": 2
"$ref": 1
"$doctype": Secondary
"name": ".NET"
"level": 5
"experience": 12

//Child Document - ContentType: Skill
"$id": 3
"$ref": 1
"$doctype": Secondary
"name": "JavaScript"
"level": 3
"experience": 6

等。

我加了一些元田

3)第三个选项,因为我发现是指数的属性名称相同的多个字段,所以上面的例子则导致:

3) A third Option I have found since is to Index the properties as the multiple fields with the same name, so the above example would then result in:

// index: 0
"skills.name": ".NET"
"skills.level": 5
"skills.experience": 12
// index: 1
"skills.name": "JavaScript"
"skills.level": 3
"skills.experience": 6
// index: 2
"skills.name": "HTML5"
"skills.level": 4
"skills.experience": 6
// index: 3
"skills.name": "Lucene.NET"
"skills.level": 1
"skills.experience": 12
// index: 4    
"skills.name": "C#"
"skills.level": 10
"skills.experience": 12

这是Lucene.NET的支持,但它仍然让我背后的需求进行查询,如:[skill.name:.NET和skill.level:[3到5]

This is supported by Lucene.NET, yet it still leaves me behind on the demand to query like: [skill.name: ".NET" AND skill.level: [3 TO 5]].

但由于这并不让我在地里分别搜索,我也许可以解决另一个问题以另一种方式:

But since this does allow me to search in the fields separately, I might be able to solve the other issue in another way by:


  • 一)增加一个额外的组合场。

  • b)在结果的收集让邮政的验证。

  • 上述的
  • C)组合
  • a) adding an extra combined field.
  • b) make Post validations in a collector on the results.
  • c) combination of the above

所有根据数据,显然坚持发布数据的验证就像上面,因为我可能得到虚假点击的配发将产生非常糟糕的结果。它仍然会过滤掉没有人但是.NET技能,这是一件好事。

All depending on the data, obviously sticking to post validation of data like the above would yield really bad results as I am likely to get allot of false hits. It will still filter out people without .NET skills however which is a good thing.

但是,至少到目前为止我更近了一步,我想。

But At least so far I am a step closer, I think.

采取上述情况下,我们现在可以有:(大大缩短了)

Taken the scenario above, we can now have: (shortened greatly)

[{
  name: "Tony",
  skills: [ 
    { name: ".NET",       level: 1 },
    { name: "JavaScript", level: 3 },
    { name: "HTML5",      level: 5 }
  ]
 },
 {
  name: "Peter",
  skills: [ 
    { name: ".NET",       level: 5 },
    { name: "HTML5",      level: 3 },
    { name: "Lucene.NET", level: 1 }
  ]
 },
 {
  name: "Marilyn",
  skills: [ 
    { name: "JavaScript", level: 5 },
    { name: "HTML5",      level: 3 },
    { name: "Node",       level: 1 }
  ]
 }]

我们得到的是3个文件与 skills.name 重复的领域和 skills.level ,这很好。 ..我其实可以搜索{skills.name:JavaScript的,skills.level:[1至5]}。这正确返回玛丽莲和托尼

What we get is 3 documents with duplicate fields for skills.name and skills.level, that's fine... And I can actually search for { skills.name: 'JavaScript', skills.level: [1 TO 5] } which correctly returns Marilyn and Tony.

但是,如果我搜索{skills.name:JavaScript的,skills.level:[4-5]}我显然仍与构建我应该只已经得到玛丽莲作为一个文档的这种方式让他们两人结果。

But if I search for { skills.name: 'JavaScript', skills.level: [4 TO 5] } I obviously still get both of them with this way of structuring the document where I should only have gotten Marilyn as a result.

因此​​,需要有一个过滤后,将拒绝托尼作为一个实际的比赛...

Hence the need for a post filtering that will reject Tony as an actual match...

推荐答案

现在我最终接受的解决方案3的局限性,对于合理性在于,如果它需要以这种方式来查询数据时,应进行结构不同在指数(符合溶液2)。

For now I ended up Accepting the Limitations of Solution 3, the rationality for that is that If it's needed to query data in that way, data should be structured differently in the index (in line with Solution 2).

但我选择了,如果可能的框架处理这个移出这一决定。因此,我创建了 https://github.com/dotJEM/json-index

But I have chosen to move that decision outside if a possible framework handling this. As a result I have created https://github.com/dotJEM/json-index

这篇关于在Lucene.NET索引JSON对象数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆