如何为B2B网站应用程序设置Lucene / Solr? [英] How to setup Lucene/Solr for a B2B web app?

查看:154
本文介绍了如何为B2B网站应用程序设置Lucene / Solr?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定:


  • 每个客户端(企业客户)1个数据库
  • 5000个客户端

  • 客户端拥有2到2000个用户(平均大约100个用户/客户端)

  • 每个数据库100k到1000万条记录

  • 用户需要经常搜索这些记录(这是导航他们数据的最佳方式)



可能有关的信息:


  • 每周有几位新客户(工作时间内的任何时候)

  • 多个Web服务器和数据库服务器(用户可以通过任何Web服务器登录)

  • 由于Lucene(和Solr)有广泛的支持,让我们保持对语言或sql品牌的不可知性。


    例如:

    Joel Spolsky在

    他们使用 index per client and store it in the client's database 。我不确定细节。我不确定这是否是对Lucene的严重修改。



    问题:



    您如何设置Lucene搜索,以便每个客户端只能在其数据库中搜索?



    您将如何设置索引?

    您在哪里存储索引?

    您是否需要为所有搜索查询添加过滤器?

    如果客户端取消了,您将如何删除它们(部分)索引? (这可能是微不足道的 - 尚不确定)



    可能的解决方案:

    为每个客户端(数据库)创建索引

    $ ul

  • Pro:搜索速度更快(比单索引全部方法更快)。指数与客户数据的大小有关。 Con:我不确定这是什么,也不知道这是否超出了Lucene的范围。



使用database_name字段具有单个巨大的索引。始终包含database_name作为过滤器。




  • 临:不确定。对于技术支持或计费部门来说,可能会搜索所有数据库以获取信息。

  • Con:搜索速度较慢(低于index-per-client方法)。



最后一件事:

我会也接受使用 Solr (Lucene的扩展)的答案。也许它更适合这个问题。不知道。

解决方案

你从FogBugz StackExchange召唤了我。我的名字是裘德,我是FogBugz的当前搜索架构师。

下面是如何设置FogBugz按需搜索体系结构的粗略概述:


  • 由于与数据可移植性,安全性等有关的原因,我们将所有按需数据库和索引分开存放。

  • 虽然我们确实使用了Lucene(实际上是Lucene.NET),但我们已经很大程度地修改了它的后端,以便它可以将其索引完全存储在数据库中。此外,每个虚拟主机上都会保留一个本地缓存,以便尽可能避免不必要的数据库访问。

  • 我们的过滤器几乎全部是数据库端(因为它们被FogBugz在搜索之外),所以我们的搜索解析器将查询分离为全文和非全文组件,执行查找并合并结果。这有点不幸,因为它使得Lucene能够做出许多有用的优化。



我们有什么好处已经完成了。管理帐户非常简单,因为客户端数据及其索引存储在同一个地方。不过,也有一些消极因素,例如一系列令人讨厌的边缘病例搜索,这些病例的搜索性能低于我们的最低标准。回顾过去,我们的搜索过程很酷,而且做得很好。但是,如果我再次执行此操作,我会阻止此方法

简单地说,除非您的搜索域非常特殊,或者您愿意将开发人员用于快速搜索,否则您可能会被优秀的产品像ElasticSearch,Solr或Xapian。



如果我今天这样做了,除非我的搜索域非常具体,否则我可能会为我的数据库使用 ElasticSearch,Solr或Xapian -backed全文搜索解决方案。至于哪一个,这取决于你的辅助需求(平台,查询类型,可扩展性,容忍一组怪异于另一个等等)。

关于一个大指数与许多(!)分散指数:两者都可以工作。我认为这个决定实际上取决于你想要建立什么样的架构,以及你需要什么样的性能。如果您决定2秒钟的搜索响应是合理的,那么您可以非常灵活,但一旦您开始说超过200毫秒的任何内容是不可接受的,您的选项就会很快消失。尽管为所有客户维护一个大型搜索索引可能比处理大量小型索引更有效率,但并不一定更快(正如您指出的那样)。我个人认为,在安全的环境中,保持客户数据分离的好处不可低估。当索引被破坏时,它不会使所有搜索停止;愚蠢的小错误不会暴露敏感数据;用户帐户保持模块化 - 更容易提取一组帐户并将其放到新服务器上;等等。



我不确定这是否回答了您的问题,但我希望我至少满足您的好奇心: - )



[1]:2013年,FogBugz开始通过ElasticSearch提供搜索和过滤功能。我们喜欢它。


Given:

  • 1 database per client (business customer)
  • 5000 clients
  • Clients have between 2 to 2000 users (avg is ~100 users/client)
  • 100k to 10 million records per database
  • Users need to search those records often (it's the best way to navigate their data)

Possibly relevant info:

  • Several new clients each week (any time during business hours)
  • Multiple web servers and database servers (users can login via any web server)
  • Let's stay agnostic of language or sql brand, since Lucene (and Solr) have a breadth of support

For Example:

Joel Spolsky said in Podcast #11 that his hosted web app product, FogBugz On-Demand, uses Lucene. He has thousands of on-demand clients. And each client gets their own database.

They use an index per client and store it in the client's database. I'm not sure on the details. And I'm not sure if this is a serious mod to Lucene.

The Question:

How would you setup Lucene search so that each client can only search within its database?

How would you setup the index(es)?
Where do you store the index(es)?
Would you need to add a filter to all search queries?
If a client cancelled, how would you delete their (part of the) index? (this may be trivial--not sure yet)

Possible Solutions:

Make an index for each client (database)

  • Pro: Search is faster (than one-index-for-all method). Indices are relative to the size of the client's data.
  • Con: I'm not sure what this entails, nor do I know if this is beyond Lucene's scope.

Have a single, gigantic index with a database_name field. Always include database_name as a filter.

  • Pro: Not sure. Maybe good for tech support or billing dept to search all databases for info.
  • Con: Search is slower (than index-per-client method). Flawed security if query filter removed.

One last thing:
I would also accept an answer that uses Solr (the extension of Lucene). Perhaps it's better suited for this problem. Not sure.

解决方案

You summoned me from the FogBugz StackExchange. My name is Jude, I'm the current search architect for FogBugz.

Here's a rough outline of how the FogBugz On Demand search architecture is set up[1]:

  • For reasons related to data portability, security, etc., we keep all of our On Demand databases and indices separate.
  • While we do use Lucene (Lucene.NET, actually), we've modded its backend fairly substantially so that it can store its index entirely in the database. Additionally, a local cache is maintained on each webhost so that unnecessary database hits can be avoided whenever possible.
  • Our filters are almost entirely database-side (since they're used by aspects of FogBugz outside of search), so our search parser separates queries into full-text and non-full-text components, executes the lookups, and combines the results. This is a little unfortunate, as it voids many useful optimizations that Lucene is capable of making.

There are a few benefits to what we've done. Managing the accounts is quite simple, since client data and their index are stored in the same place. There are some negatives too, though, such as a set of really pesky edge case searches which underperform our minimum standards. Retrospectively, our search was cool and well done for its time. If I were to do it again, however, I would discourage this approach.

Simply, unless your search domain is very special or you're willing to dedicate a developer to blazingly fast search, you're probably going to be outperformed by an excellent product like ElasticSearch, Solr, or Xapian.

If I were doing this today, unless my search domain was extremely specific, I would probably use ElasticSearch, Solr, or Xapian for my database-backed full-text search solution. As for which, that depends on your auxiliary needs (platform, type of queries, extensibility, tolerance for one set of quirks over another, etc.)

On the topic of one large index versus many(!) scattered indices: Both can work. I think the decision really lies with what kind of architecture you're looking to build, and what kind of performance you need. You can be pretty flexible if you decide that a 2-second search response is reasonable, but once you start saying that anything over 200ms is unacceptable, your options start disappearing pretty quickly. While maintaining a single large search index for all of your clients can be vastly more efficient than handling lots of small indices, it's not necessarily faster (as you pointed out). I personally feel that, in a secure environment, the benefit of keeping your client data separated is not to be underestimated. When your index gets corrupted, it won't bring all search to a halt; silly little bugs won't expose sensitive data; user accounts stay modular- it's easier to extract a set of accounts and plop them onto a new server; etc.

I'm not sure if that answered your question, but I hope that I at least satisfied your curiosity :-)

[1]: In 2013, FogBugz began powering its search and filtering capabilities with ElasticSearch. We like it.

这篇关于如何为B2B网站应用程序设置Lucene / Solr?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆