将子文档添加到现有 Solr 6.4 集合文档会创建重复文档 [英] Adding child documents to existing Solr 6.4 collection documents creates duplicate documents

查看:17
本文介绍了将子文档添加到现有 Solr 6.4 集合文档会创建重复文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题类似于 Solr 不会覆盖 - 重复的 uniqueKey 条目,但我的情况是我有已经添加到集合中的大量现有文档,没有子文档,我使用(独立而非云)Solr 6.4 而不是 5.3.1.我们最近启用了子文档,以便我们可以存储更丰富的数据.

This question is similar to Solr doesn't overwrite - duplicated uniqueKey entries, but I am in a situation where I have a large body of existing documents that have already been added to the collection with no child documents, and I am using (standalone not cloud) Solr 6.4 rather than 5.3.1. We recently enabled child documents so that we could store richer data.

我们使用 SolrJ 加载数据并查询 Solr,但为了隔离我们看到的问题,我使用命令行 Solr post 工具上传以下文档:

We use SolrJ to load data into and query Solr, but to isolate the issue we're seeing, I used the command line Solr post tool to upload the following document:

<add>
    <doc>
        <field name="id">1</field>
        <field name="solr_record_type">1</field>
        <field name="title">Fabulous Book</field>
        <field name="author">Angelo Author</field>
    </doc>
</add>

搜索结果符合预期:使用 q=id:1fl=id,title,index_date,[child parentFilter="solr_record_type:1"]

Search results were as expected: Using q=id:1 and fl=id,title,index_date,[child parentFilter="solr_record_type:1"]

 "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"1",
        "title":"Fabulous Book",
        "index_date":"2019-01-16T23:06:57.221Z"}]
  }

然后我通过发布以下内容更新了文档:

Then I updated the document by posting the following:

<add>
    <doc>
        <field name="id">1</field>
        <field name="solr_record_type">1</field>
        <field name="title">Fabulous Book</field>
        <field name="author">Angelo Author</field>
        <doc>
            <field name="id">1-1</field>
            <field name="solr_record_type">2</field>
            <field name="contributor_name">Polly Math</field>
            <field name="contributor_type">3</field>
        </doc>
    </doc>
</add>

然后,重复我的搜索,我得到以下重复的结果,搜索唯一的 id 字段,这是不可取的.

Then, repeating my search, I got the following duplicate result, searching on the unique id field, which is undesirable.

    "response":{"numFound":2,"start":0,"docs":[
      {
        "id":"1",
        "title":"Fabulous Book",
        "index_date":"2019-01-16T23:06:57.221Z",
        "_childDocuments_":[
        {
          "id":"1-1",
          "solr_record_type":2,
          "contributor_name":"Polly Math",
          "contributor_type":3,
          "index_date":"2019-01-16T23:09:29.142Z"}]},
      {
        "id":"1",
        "title":"Fabulous Book",
        "index_date":"2019-01-16T23:09:29.142Z",
        "_childDocuments_":[
        {
          "id":"1-1",
          "solr_record_type":2,
          "contributor_name":"Polly Math",
          "contributor_type":3,
          "index_date":"2019-01-16T23:09:29.142Z"}]}]
  }

反过来说,如果我从最初加载子文档的文档开始,如下所示:

Going the other way, if I start with a document that was loaded initially with a child document, like the following:

<add>
    <doc>
        <field name="id">2</field>
        <field name="solr_record_type">1</field>
        <field name="title">Wonderful Book</field>
        <field name="author">Andy Author</field>
        <doc>
            <field name="id">2-1</field>
            <field name="solr_record_type">2</field>
            <field name="contributor_name">Polly Math</field>
            <field name="contributor_type">3</field>
        </doc>
    </doc>
</add>

然后我用一个没有孩子的文档更新它:

And then I update it with a document with no children:

<add>
    <doc>
        <field name="id">2</field>
        <field name="solr_record_type">1</field>
        <field name="title">Wonderful Book</field>
        <field name="author">Andy Author</field>
    </doc>
</add>

结果还是有孩子:

  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"2",
        "title":"Wonderful Book",
        "index_date":"2019-01-16T23:09:39.389Z",
        "_childDocuments_":[
        {
          "id":"2-1",
          "title_id":2,
          "title_instance_id":2,
          "solr_record_type":2,
          "contributor_name":"Polly Math",
          "contributor_type":3,
          "index_date":"2019-01-16T23:07:04.861Z"}]}]
  }

这很奇怪,因为如果我用只有 1 个子文档的替换文档更新一个包含 2 个子文档的文档,它确实会删除一个子文档.但在这种情况下,它不会删除子文档.

This is strange because if I update a document with 2 child documents with a replacement document with only 1 child document, it does drop one child document. But in this case, it is not dropping the child document.

没有子文档但不添加子文档的文档更新,以及带有不删除所有子文档的子文档的文档更新似乎都按我的预期工作.

Updates of documents with no child documents that don't add child documents, and updates of documents with child documents that don't remove all child documents both seem to work as I'd expect.

我有大量没有孩子的现有文件,我可能会向其中添加孩子,最终我可能有很多可能会放弃孩子的有孩子的文件.鉴于此,在不生成重复记录或丢失更新的情况下更新这些记录的最佳方法是什么?

I have a large body of existing documents that don't have children, which I may be adding children to, and eventually I may have a lot of child-having documents that might drop their children. Given that, what is the best way to update these records without generating duplicate records or losing updates?

推荐答案

我强烈建议避免 Solr 父/子关系.我们决定在 Solr 5.3.1 中使用它们,结果证明虽然有很多功能,但自 4.x 以来,Solr 中存在许多令人讨厌的错误,这些错误仍未修复,包括

I would strongly advise avoiding Solr parent/child relationships. We decided to use them in Solr 5.3.1 and it turns out that although much of the functionality is there, there are a number of nasty bugs present in Solr since 4.x that remain unfixed including

  • SOLR-6096:支持嵌套文档的更新和删除莉>
  • SOLR-5211:将父级更新为无子女会使年长的儿童成为孤儿(更新:在 8.0 中修复)
  • SOLR-6596:原子更新和添加子文档无法协同工作
  • SOLR-5772:solr块连接"文档之间的重复文档和普通"文档
  • SOLR-10030:Solrj 中的 SolrClient.getById() 方法没有'不检索子文档
  • SOLR-6096: Support Update and Delete on nested documents
  • SOLR-5211: updating parent as childless makes old children orphans (UPDATE: fixed in 8.0)
  • SOLR-6596: Atomic update and adding child doc not working together
  • SOLR-5772: duplicate documents between solr "block join" documents and "normal" document
  • SOLR-10030: SolrClient.getById() method in Solrj doesn't retrieve child documents

出于这些原因,如果可能的话,我强烈建议避免使用子文档.即使这些问题现在没有影响到您,将来某个时候也会出现,而且很明显,鉴于它们尚未在 3 到 4 个主要版本中得到修复,产品中没有对子文档的真正支持.很抱歉成为坏消息的承载者,但希望有人能从我们的经验中吸取教训.

For those reasons, if at all possible, I strongly recommend AVOID using child documents. Even if those issues don't hit you now they will in the future at some point and it's clear, given that they have not been fixed in 3 to 4 major versions, that there is no real support in the product for child documents. Sorry to be the bearer of bad news but hopefully someone can learn from our experience.

这篇关于将子文档添加到现有 Solr 6.4 集合文档会创建重复文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆