Sphinx/Solr/Lucene/弹性相关性 [英] Sphinx/Solr/Lucene/Elastic Relevancy

查看:89
本文介绍了Sphinx/Solr/Lucene/弹性相关性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们拥有一个庞大的数据库,其中包含30多种产品,需要对其进行查询以创建搜索结果,并且每秒显示广告数千次.我们一直在寻找Sphinx,Solr,Lucene和Elastic作为执行这些不断进行的大规模搜索的选项.

We have an extremely large database of 30+ Million products, and need to query them to create search results and ad displays thousands of times a second. We have been looking into Sphinx, Solr, Lucene, and Elastic as options to perform these constant massive searches.

这是我们需要做的.获取关键字并在数据库中运行它们以查找最匹配的产品.我们将使用OWN算法来确定与我们的广告最相关的产品,但是我们知道这些引擎已经具有自己的相关性算法.

Here's what we need to do. Take keywords and run them through the database to find products that match the closest. We're going to be using our OWN algorithm to decide which products are most related to target our advertisements, but we know that these engines already have their own relevancy algorithms.

因此,我们的问题是我们如何才能有效地在引擎之上使用我们自己的算法.是否可以将它们作为某种模块添加到引擎本身?还是我们必须重写引擎的相关代码?我想我们可以通过执行多个查询来从应用程序中实现算法,但这确实会降低效率.

So, our question is how can we use our own algorithms on top of the engine's, efficiently. Is it possible to add them to the engines themselves as a module of some sort? Or would we have to rewrite the engine's relevancy code? I suppose we could implement the algorithm from the application by executing multiple queries, but this would really kill efficiency.

此外,我们想知道哪种搜索解决方案最适合我们.现在,我们倾向于使用狮身人面像,但我们确实不确定.

Also, we'd like to know which search solution would work best for us. Right now we're leaning towards Sphinx, but we're really not sure.

此外,您是否建议通过MySQL运行这些引擎,还是最好在某些类型的键值存储(如Cassandra)上运行它们?请记住,有3000万条记录,并且随着我们前进,可能会翻倍.

Also, would you recommend running these engines over MySQL, or would it be better to run them over some type of key-value store like Cassandra? Keep in mind there are 30 Million records, and likely to double as we move along.

感谢您的回复!

推荐答案

由于我没有使用所有产品,因此我无法给您完整的答案,但是我可以说一些可能有帮助的事情.

I can't give you an entire answer, as I haven't used all the products, but I can say some things which might help.

  1. Lucene/Solr使用向量空间模型.我不确定您使用的是自己的"算法是什么意思,但是如果它与tf/idf的概念相距太远(例如,使用神经网络),那么您将很难适应它变成了lucene.如果仅凭您自己的算法,您只是意味着要比其他术语更重地称重某些术语,那就很好了.基本上,lucene存储有关术语对文档的重要性的信息.如果要重新定义术语的重要性的计算,这很容易做到.如果您想摆脱术语对文档的重要性的整个概念,那就太麻烦了.
  2. Lucene(因此是Solr)以其自定义格式存储内容.您不需要使用数据库. 3000万条记录并不是很大的lucene索引(当然,取决于每条记录的大小).如果确实要使用数据库,请使用hadoop.
  3. 通常,您将要使用Solr而不是Lucene.

我发现修改Lucene非常容易.但是,正如我的第一个要点说的那样,如果您要使用的算法不是基于某个术语对文档重要性的某种观念,那么我认为Lucene并不是要走的路.

I have found it very easy to modify Lucene. But as my first bullet point said, if you want to use an algorithm that's not based on some notion of a term's importance to a document, I don't think Lucene will be the way to go.

这篇关于Sphinx/Solr/Lucene/弹性相关性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆