将减少的ES查询结果传递给SQL [英] Passing reduced ES query results to SQL

查看:157
本文介绍了将减少的ES查询结果传递给SQL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是如何将ElasticSearch查询传递给hadoop的后续问题



基本上,我想在ElasticSearch中进行全文搜索,然后将结果集传递给SQL以运行聚合查询。以下是一个例子:



我们在财务数据库中搜索具有10B记录的终结者。它有以下比赛:




  • 终结者(1M个结果)

  • 终结者2 (10M结果)

  • XJ4-227(1结果==>这里的终结者在标题的摘要中)



而不是传回10 + M ids,我们会传回以下缩减查询 -



<$ p $ ('Terminator','Terminator 2','XJ4-227')

我们如何编写这样的算法来将ES结果集减少到可以发送回SQL的最小可能的过滤器查询? ES有什么样的匹配元数据可以帮助我们吗?

解决方案

如果你知道哪个未分析关键字在5.x)字段将适合您的用例,您可以通过术语汇总 sum_other_doc_count 甚至告诉您搜索是否导致太多不同的值,因为只返回前N个。



您可以在多个字段上运行术语聚合,并使用SQL中具有最少不同值的一个。实际上,首先运行基数聚合,知道您应该运行术语聚合到哪个字段。



如果您的搜索是纯过滤器,那么其结果应该缓存,请您对两个解决方案进行基准测试,因为您的ES群集有相当多的数据。


This is a follow-up question to How to pass ElasticSearch query to hadoop.

Basically, I want to do a full-text-search in ElasticSearch and then pass the result set to SQL to run an aggregation query. Here's an example:

Let's say we search "Terminator" in a financials database that has 10B records. It has the following matches:

  • "Terminator" (1M results)
  • "Terminator 2" (10M results)
  • "XJ4-227" (1 result ==> Here "Terminator" is in the synopsis of the title)

Instead of passing back the 10+M ids, we'd pass back the following 'reduced query' --

...WHERE name in ('Terminator', 'Terminator 2', 'XJ4-227')

How could we write such an algorithm to reduce the ES result set to a smallest possible filter query that we could send back to SQL? Does ES have any sort of match-metadata that would help us in this?

解决方案

If you know that which "not analyzed" (keyword at 5.x) field would be suitable for your use case you could get their distinct values and number of matches by terms aggregation. sum_other_doc_count even tells you if your search resulted in too many distinct values, as only top N are returned.

Naturally you could run terms aggregation on multiple fields and use the one in SQL which had fewest distinct values. And actually it could be more efficient to first run cardinality aggregation to know to which field you should run terms aggregation.

If your search is a pure filter then its result should be cached but please benchmark both solutions as your ES cluster has quite a lot of data.

这篇关于将减少的ES查询结果传递给SQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆