Lucene的实现与多台网络服务器现有的.NET / SQL服务器堆栈 [英] Implement Lucene on Existing .NET / SQL Server stack with multiple webservers

查看:188
本文介绍了Lucene的实现与多台网络服务器现有的.NET / SQL服务器堆栈的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想看看使用Lucene为我目前正在管理一个网站全文搜索解决方案。该网站完全是建立在SQL Server 2008中/ C#.NET 4的技术。我期待指数数据其实很简单,只有一对夫妇的每个记录的字段只有一个这些领域的实际搜索。

这是我不明白什么是最好的工具,我需要使用的,或者是我应该使用的架构。具体做法是:

  1. 我应该在哪里把指数?我见过的人建议把它放在网络服务器,但是这似乎浪费了大量的Web服务器的。当然,集中会更好吗?

  2. 如果该指数是集中式的,我怎么会查询它,因为它只是生活在文件系统?请问我要有效地把它放在一个网络共享,所有的Web服务器可以看到?

  3. 有没有pre-现有的工具,将逐步填充Lucene索引按计划,从SQL Server的数据库中提取数据?我会过得更好这里滚动我自己的服务?

  4. 当我查询索引,我应该找刚拉回来一堆记录ID的,然后我回到DB实际记录,或者我应该瞄准拉我需要的一切了搜索直出指数?

  5. 有没有在试图实施类似Solr的本味的环境价值?如果是的话,我可能给它自己的* nix的虚拟机,并​​在Tomcat上运行它。但我不知道是什么的Solr会买我的这种情况。

解决方案

我会回答有点基于我们如何选择实施的 Lucene.Net这里对堆栈溢出和一些教训我前进的道路上获悉:

  

我应该在哪里把指数?我见过的人建议把它放在网络服务器,但是这似乎浪费了大量的Web服务器的。当然,集中会更好吗?

  • 这取决于你的目标在这里,我们有一个的狠狠的未充分利用的Web层(〜10%的CPU),以及超载数据库做全文搜索(60%左右的CPU,我们希望它低)。加载上的每个的Web层,让我们利用这些机器,有一个吨冗余,我们仍然可以失去九成的web服务器,并保持堆栈交易所相同的索引网络了,如果需要的话。有一个缺点这种,这是非常IO(读)强化对我们来说,和Web层不买了这一点(这往往是大多数公司的情况下)。虽然它工作正常,我们还是会被我们的web层升级到固态硬盘,并实施一些其他位留出净端口弥补这一缺陷的硬件( NIOFSDirectory 的例子)。
  • 如果我们索引所有的数据库 N 次的Web层,但幸运的是我们并没有饥饿的网络带宽和SQL服务器缓存结果的另一个缺点使得这种每次一个非常快的增量索引操作。随着大量的Web服务器,单独可以解决这个选项。
  

如果该指数是集中的,我怎么会质疑它,因为它只是生活在文件系统?请问我要有效地把它放在一个网络共享,所有的Web服务器可以看到?

  • 您可以在文件共享查询它无论哪种方式,只要确保只有一个时间索引( write.lock ,该目录锁定机制将确保这一点,当您尝试多个IndexWriters一次错误)。
  • 请注意上面我的笔记,这是IO时,很多读者都飞来飞去密集,所以你需要足够的带宽来你的店,总之至少iSCSI或光纤SAN的,我会谨慎的这一在高流量的方法(每天几十万的搜索)使用。
  • 在另一个需要考虑的是如何更新/提醒你的Web服务器(或任何层被查询的话)。当你完成一个索引通,你就需要重新打开你的的IndexReader s要获得更新的索引与新文档。我们使用 Redis的短信频道提醒谁在乎,该指数已更新......任何消息机制将在这里工作。
  

有没有pre-现有的工具,将逐步填充Lucene索引按计划,从SQL Server的数据库中提取数据?我会过得更好这里滚动我自己的服务?

  • 不幸的是没有,我知道,但我可以和你我是如何处理这一点。
  • 分享
  • 当索引一个特定的表(类似于在Lucene的文档),我们增加了一个 rowversion 到该表。当我们的索引,我们选择基于了最后rowversion(一时间戳的数据类型,拉回作为< A HREF =htt​​p://msdn.microsoft.com/en-us/library/ms187745.aspx> BIGINT )。我选择了存储的最后一个索引的日期和上次索引rowversion在文件系统上通过一个简单的txt文件为一个原因:在Lucene的一切被存储在那里。这意味着,如果有过大的问题,你可以删除包含索引和下一个索引通将恢复,并有一个完全跟上最新的索引文件夹,只需添加一些code,以处理任何在那里,意思是指标一切。
  

当我查询索引,我应该找刚拉回来一堆记录ID的,然后我回到DB实际记录,或者我应该瞄准拉我需要的一切对于搜索直出索引?

  • 在此的真正的取决于你的数据,这对我们来说不是一个真正可行的一切存储在索引(这也不是推荐)。我的建议是你存储的字段为索引中的搜索结果,并通过我的意思是什么,你需要在列表中的 present 的搜索结果,用户点击进入前完整的[插入此类型。
  • 在另一个要考虑的是你的数据的频率变化。如果很多领域你的没有的搜索上正在迅速发生变化,则需要重新建立索引的行(文件)来更新索引,不仅当你在搜索领域修改。
  

有没有在试图实施类似Solr的本味的环境价值?如果是的话,我可能给它自己的* nix的虚拟机,并​​在Tomcat上运行它。但我不知道是什么的Solr会买我的这种情况。

  • 当然有,那就是你说的是集中式搜索(具有大量的搜索,你可能会再次触及跌停与VM设置,留意这一点)。我们没有这样做,因为它推出了很多(我们认为)毫无根据的复杂性,我们的技术堆栈和建立的过程,但对于Web服务器的数量较多它使的的更有意义。
  • 这是什么给你买?性能为主,和一个专用的索引服务器(多个)。而不是 N 服务器抓取网络共享(IO的竞争也一样),他们可以打的的只有的处理过的请求和结果的单台服务器网络,不爬这是一个很大的数据来回指数...这将是本地Solr的服务器(S)上。此外,你不打你的SQL服务器一样多,因为更少的服务器索引。
  • 什么它的的买你的是尽可能多的冗余,但它给你这是多么重要。如果能操作罚款,降级搜索或没有它,只需您的应用程序处理的。如果您不能的,则备份的Solr服务器或多个也可以是一个有效的解决方案,...并且可以另一个软件栈来维持。

I want to look at using Lucene for a fulltext search solution for a site that I currently manage. The site is built entirely on SQL Server 2008 / C# .NET 4 technologies. The data I'm looking to index is actually quite simple, with only a couple of fields per record and only one of those fields actually searchable.

It's not clear to me what the best toolset I need to be using is, or what the architecture I should be using is. Specifically:

  1. Where should I put the index? I've seen people recommend putting it on the webserver, but that would seem wasteful for a large number of webservers. Surely centralising would be better here?

  2. If the index is centralised, how would I query it, given that it just lives on the filesystem? Will I have to effectively put it on a network share that all the webservers can see?

  3. Are there any pre-existing tools that will incrementally populate a Lucene index on a schedule, pulling the data from an SQL Server database? Would I be better off rolling my own service here?

  4. When I query the index, should I be looking to just pull back a bunch of record id's which I then go back to the DB for the actual record, or should I be aiming to pull everything I need for the search straight out of the index?

  5. Is there value in trying to implement something like Solr in this flavour environment? If so, I'd probably give it it's own *nix VM and run it within Tomcat on that. But I'm not sure what Solr would buy me in this case.

解决方案

I'll answer a bit based on how we chose to implement Lucene.Net here on Stack Overflow, and some lessons I learned along the way:

Where should I put the index? I've seen people recommend putting it on the webserver, but that would seem wasteful for a large number of webservers. Surely centralising would be better here?

  • It depends on your goals here, we had a severely under-utilized web tier (~10% CPU), and an overloaded database doing FullText searching (around 60% CPU, we wanted it lower). Loading up the same index on each web tier let us utilize those machines and have a ton of redundancy, we can still lose 9 out of 10 web servers and keep the Stack Exchange network up if need be. There is a downside to this, it's very IO (read) intensive for us, and the web tier was not bought with this in mind (this is often the case at most companies). While it works fine, we'll still be upgrading our web tier to SSDs and implementing some other bits left out of the .Net port to compensate for this hardware deficiency (NIOFSDirectory for example).
  • The other downside if we index all our databases n times for the web tier, but luckily we're not starved for network bandwidth and SQL server caching the results makes this a very fast delta indexing operation each time. With a large number of web servers, that alone may eliminate this option.

If the index is centralised, how would I query it, given that it just lives on the filesystem? Will I have to effectively put it on a network share that all the webservers can see?

  • You can query it on a file share either way, just make sure only one is indexing at a time (write.lock, the directory locking mechanism will ensure this and error when you try multiple IndexWriters at once).
  • Keep in mind my notes above, this is is IO intensive when a lot of readers are flying around, so you need ample bandwidth to your store, short of at least iSCSI or a fiber SAN, I'd be cautious of this approach on a high traffic (hundreds of thousands of searches a day) use.
  • Another consideration is how you update/alert your web servers (or whatever tier is querying it). When you finishing an indexing pass, you'll need to re-open your IndexReaders to get the updated index with new documents. We use a redis messaging channel to alert whoever cares that the index has updated...any messaging mechanism would work here.

Are there any pre-existing tools that will incrementally populate a Lucene index on a schedule, pulling the data from an SQL Server database? Would I be better off rolling my own service here?

  • Unfortunately there are none that I know of, but I can share with you how I approached this.
  • When indexing a specific table (akin to a document in Lucene), we added a rowversion to that table. When we index we select based off the last rowversion (a timestamp datatype, pulled back as a bigint). I chose to store the last index date and last indexed rowversion on the file system via a simple .txt file for one reason: everything else in Lucene is stored there. This means if there's ever a large problem, you can just delete the folder containing the index and the next indexing pass will recover and have a fully up-to-date index, just add some code to handle nothing being there meaning "index everything".

When I query the index, should I be looking to just pull back a bunch of record id's which I then go back to the DB for the actual record, or should I be aiming to pull everything I need for the search straight out of the index?

  • This really depends on your data, for us it's not really feasible to store everything in the index (nor is this recommended). What I suggest is you store the fields for your search results in the index, and by that I mean what you need to present your search results in a list, before the user clicks to go to the full [insert type here].
  • Another consideration is how often your data is changing. If a lot of fields you're not searching on are changing rapidly, you'll need to re-index those rows (documents) to update your index, not only when the field you're searching on changes.

Is there value in trying to implement something like Solr in this flavour environment? If so, I'd probably give it it's own *nix VM and run it within Tomcat on that. But I'm not sure what Solr would buy me in this case.

  • Sure there is, it's the centralized search you're talking about (with a high number of searches you may again hit a limit with a VM setup, keep an eye on this). We didn't do this because it introduced a lot of (we feel) unwarranted complexity in our technology stack and build process, but for a larger number of web servers it makes much more sense.
  • What does it buy you? performance mainly, and a dedicated indexing server(s). Instead of n servers crawling a network share (competing for IO as well), they can hit a single server that only deals with requests and results over the network, not crawling the index which is a lot more data going back and forth...this would be local on the Solr server(s). Also, you're not hitting your SQL server as much since fewer servers are indexing.
  • What it doesn't buy you is as much redundancy, but it's up to you how important this is. If you can operate fine on degraded search or without it, simply have your app handle that. If you can't, then a backup Solr server or more may also be a valid solution...and it is possible another software stack to maintain.

这篇关于Lucene的实现与多台网络服务器现有的.NET / SQL服务器堆栈的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆