Lucene和SQL Server-最佳实践 [英] Lucene and SQL Server - best practice

查看:83
本文介绍了Lucene和SQL Server-最佳实践的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Lucene来说还很陌生,所以想从你们这里得到一些帮助:)

I am pretty new to Lucene, so would like to get some help from you guys :)

背景:目前,我有存储在SQL Server中的文档,并且想使用Lucene在SQL Server中对这些文档进行全文本/标记搜索.

BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server.

Q1)在这种情况下,为了对文档进行关键字搜索,我是否应该将所有这些文档都插入到Lucene索引中?这是否意味着会有重复数据(一个在SQL Server中,另一个在Lucene索引中?),这可能是一个问题,因为我们有大量的文档(大约100GB).这是不可避免的吗?

Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). Is it inevitable?

Q2)此外,每个文档都有一组标签(最多3个). Lucene还是标签搜索的好选择?如果是这样,该怎么做?

Q2) Also, each documents has a set of tags (up to 3). Lucene is also good choice for the tag search? If so, how to do it?

谢谢

推荐答案

是的,通过Lucene提供全文搜索并通过传统数据库提供数据存储是一种受支持的体系结构. 在这里看看,以作简要介绍.一种典型的实现方式是对希望支持搜索的任何内容建立索引,并在Lucene索引中仅存储唯一标识符,并根据ID从数据库中检索通过搜索找到的所有记录.如果要减少数据库负载,可以在Lucene中存储一些信息以显示搜索结果列表,并且仅查询数据库以获取完整文档.

Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. Take a look here, for a brief introduction. A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document.

关于节省空间,将有一些重复措施.即使您只是Lucene,也是如此. Lucene将用于搜索的倒排索引与存储的数据完全分开存储.为了节省空间,建议您谨慎选择要索引的数据,以及需要存储和以后检索的数据.存储的内容对于节省Lucene中的空间尤为重要,因为在大多数情况下,仅索引值往往非常节省空间.

As for saving on space, there will be some measure of duplication. This is true even if you only Lucene, though. Lucene stores the inverted index used for searching entirely separately from stored data. For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases.

Lucene当然可以实现标签搜索.实现它的简单方法是在构建文档时,将每个标签添加到您选择的字段中(我称之为标签",这似乎很有意义),例如:

Lucene can certainly implement a tag search. The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as:

document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));

,我可以简单地在任何查询中添加必需的术语,以仅在特定标签内进行搜索.例如,如果我要搜索一些东西",但仅使用标签"forkids",则可以编写如下查询:

and I could simply add a required term to any query to search only within a particular tag. For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like:

some stuff +tags:forkids

这篇关于Lucene和SQL Server-最佳实践的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆