有没有办法从Azure认知搜索索引中排除NULL值 [英] Is there a way to exclude NULL values from Azure Cognitive Search Indexes

查看:62
本文介绍了有没有办法从Azure认知搜索索引中排除NULL值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我们的字段1最多为10.我想索引Azure搜索中的所有字段,以便您可以进行筛选,并在这些筛选器上进行搜索.

我的问题是,有没有一种方法可以从特定ID中排除NULL字段,而不是将其存储在Azure搜索中?参见下面的示例.

数据本身最初存储在Azure Cosmos数据库中.在Azure Cosmos DB中,它是这样的:

  • 编号1
  • 领域1:a
  • 领域2:b
  • 字段5:c
  • 字段6:d
  • 字段8:e


  • 编号2
  • 领域3:a
  • 领域2:b
  • 字段5:c
  • 字段9:d
  • 领域10:e

但是在Azure搜索索引中,它看起来像这样:

  • 编号1
  • 领域1:a
  • 领域2:b
  • 字段3:NULL
  • 字段4:NULL
  • 字段5:c
  • 字段6:d
  • 字段7:NULL
  • 领域8:e
  • 字段9:NULL
  • 字段10:NULL


  • 编号2
  • 字段1:空
  • 领域2:b
  • 领域3:a
  • 字段4:NULL
  • 字段5:c
  • 字段6:NULL
  • 字段7:NULL
  • 字段8:NULL
  • 字段9:d
  • 领域10:e

解决方案

您的问题的最短答案是否",但要比这更深.

将文档添加到Azure认知搜索索引时,每个字段的值存储在称为倒排索引的数据结构中.这将存储在该字段中找到的术语词典,并且每个条目都包含一个包含该术语的文档ID的列表.在这方面,它有点类似于面向列的数据库.您在文档JSON中看到的 null 值实际上从未存储在反向索引中.这可能会使测试字段是否为空变得昂贵,因为查询需要查找未包含在倒排索引中的所有文档ID,但在存储方面非常高效(因为它不消耗任何文档ID)./p>

本文具有一些简化的示例倒排索引的工作原理,尽管它与您的问题所涉及的主题不同.

关于在索引中定义许多字段的更广泛的关注是有效的.随着您增加索引中字段的数量,架构灵活性和资源利用率之间需要权衡.但是,这是由于每个字段所需的簿记管理费用,而不是由于字段中的空位数"所导致的.(这并不意味着什么,因为不存储空值.)

从您的问题来看,似乎您正在尝试为不同的实体类型"模型建模.在同一索引中,将导致稀疏索引,其中文档的某些子集定义了一个字段子集,而文档的另一子集定义了不同的字段.我们希望在服务中更好地支持这种情况.一个有前途的未来方向可能是支持多索引查询,因此架构的每个子集都可以拥有自己的索引,并具有自己独特的(但可能是重叠的)字段集.这不是我们的近期路线图,但是我们需要对此进行进一步调查.请对此用户语音项目以帮助我们确定优先级.

So for example we have field 1 up to 10. I want to index all the field in Azure Search, so you can filter, search on those filters.

My Question is, is there a way to just exclude the fields that are NULL from a specific ID, so not store them in Azure search? See example underneath.

The data itself is initially stored in Azure Cosmos Database. In Azure Cosmos DB it would like this:

  • Id 1
  • field 1: a
  • field 2: b
  • field 5: c
  • field 6: d
  • field 8: e


  • Id 2
  • field 3: a
  • field 2: b
  • field 5: c
  • field 9: d
  • field 10: e

However in Azure Search Index, it looks like this:

  • Id 1
  • field 1:a
  • field 2:b
  • field 3:NULL
  • field 4:NULL
  • field 5:c
  • field 6:d
  • field 7:NULL
  • field 8:e
  • field 9:NULL
  • field 10:NULL


  • Id 2
  • field 1:NULL
  • field 2:b
  • field 3:a
  • field 4:NULL
  • field 5:c
  • field 6:NULL
  • field 7:NULL
  • field 8:NULL
  • field 9:d
  • field 10:e

解决方案

The shortest answer to your question is "no", but it's a little deeper than that.

When you add documents to an Azure Cognitive Search index, the values of each field are stored in a data structure called an inverted index. This stores a dictionary of terms found in the field, and each entry contains a list of document IDs containing that term. It is somewhat similar to a column-oriented database in that regard. The null value that you see in document JSON is never actually stored in the inverted index. This can make it expensive to test whether a field is null, since the query needs to look for all document IDs not contained in the inverted index, but it is perfectly efficient in terms of storage (because it doesn't consume any).

This article has a few simplified examples of how inverted indexes work, although it's about a different topic than your question.

Your broader concern about having many fields defined in your index is a valid one. There is a tradeoff between schema flexibility and resource utilization as you increase the number of fields in your index. However, this is due to the bookkeeping overhead required for each field, not the "number of nulls in the field" (which doesn't really mean anything since nulls aren't stored).

From your question, it sounds like you're trying to model different "entity types" in the same index, resulting in a sparse index where some subset of the documents have one subset of fields defined, while another subset of documents have different fields defined. This is a scenario that we want to better support in the service. One promising future direction could be supporting multi-index query, so each subset of your schema could have its own index with its own distinct (but perhaps overlapping) set of fields. This is not on our immediate roadmap, but it's something we want to investigate further. Please vote on this User Voice item to help us prioritize.

这篇关于有没有办法从Azure认知搜索索引中排除NULL值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆