如何在数据库中搜索文本片段 [英] How to search for text fragments in a database

查看:120
本文介绍了如何在数据库中搜索文本片段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有任何开源或商业工具可以允许对数据库内容进行文本片段索引,并可以从Java查询?



问题的背景是一个大MySQL数据库表有几十万条记录,包含几个VARCHAR列。在这些列中,人们希望搜索内容的片段,因此全文索引(基于单词边界)不会有帮助。



EDIT :[添加以清楚为什么这些第一个建议不能解决问题:]



这是为什么MySQL的内置全文索引不会做的工作, Lucene或Sphinx都不会在答案中建议。我已经看过这两个,但就我所知,这些都是基于索引的字,不包括停用词和做一切真实的全文搜索的有意义的东西。但是这不合适,因为我可能会寻找一个搜索词oison,它必须匹配Roisonic Street以及Poison-Ivy。这里的关键区别是搜索字词只是列内容的片段,不需要用任何特殊字符或空格分隔。



EDIT2 :[添加了一些背景信息:]
基于此要实现的请求功能是商品管理系统中项目描述的非常宽松的搜索。用户通常不知道正确的项目编号,而只知道项目名称的一部分。不幸的是,这些描述的质量相当低,他们来自传统的系统,不能轻易改变。如果例如人们正在寻找大锤,他们将进入雪橇。使用基于字/令牌的索引,这不会找到存储为大锤的匹配,而只有那些听大锤。有各种奇怪的差异需要被覆盖,使基于令牌的方法不切实际。



目前我们唯一可以做的是 LIKE'%searchterm%'查询,有效地禁用任何索引使用并需要大量的资源和时间。



理想情况下,任何这样的工具都会创建一个索引,让我能够快速获得类似查询的结果,这样我可以实现一个聚光灯类似的搜索,只有当用户选择结果记录时,通过主键从MySQL表中检索真实数据。



如果可能,索引应该是可更新的而不需要完全重建),因为数据可能会更改,并且应该可供其他客户端立即搜索。



我很高兴获得建议和/或经验报告。



EDIT3:商业解决方案发现只有工作
即使我对这个问题有很多好的答案,想在这里注释,最后我们去了一个名为QuickFind的商业产品,由一个名为HMB Datentechnik的德国公司制造和销售。请注意,我不是以任何方式与他们建立联系,因为当我继续描述他们的产品可以做什么时,它可能会像这样。不幸的是,他们的网站看起来很不好,只是德国,但产品本身是伟大的。我目前有一个试用版从他们 - 你将不得不与他们联系,没有下载 - 我非常深刻的印象。



由于没有在线的全面的文件,我将尝试并描述我迄今为止的经验。



他们做的是基于数据库内容构建自定义索引文件。他们可以通过ODBC集成,但从我告诉客户很少这样做。相反,这是我们可能会做的 - 您从主数据库生成文本导出(如CSV),并将其馈送到其索引器。这允许你完全独立于实际的表结构(或任何SQL数据库);实际上我们从几个表中导出数据连接在一起。



根据他们的服务器(只有250kb左右,作为控制台应用程序或Windows服务运行)服务于侦听查询在TCP端口。协议是基于文本的,看起来有点老,但它是简单和工作。基本上你只是传递你想要查询的可用索引和搜索项(片段),空格分隔。
有三种输出格式,HTML / JavaScript数组,XML或CSV。目前,我正在为一些过时的有线协议工作一个Java包装器。但结果是太棒了:我目前有一个约500.000记录的示例数据集,8列索引,我的测试应用程序触发一个搜索所有8列的JTextField的内容在每个击键,而编辑并可以实时更新结果显示(JTable)!这发生,而不去到数据最初来自的MySQL实例。基于您回到的列,您可以通过使用该行的主键查询MySQL来查询原始记录(当然,需要包括在QuickFind索引中)。



索引约为文本导出版本数据大小的30-40%。索引主要受磁盘I / O速度的约束;我的500.000记录需要大约一两分钟来处理。



这是很难描述的,因为我发现它甚至很难相信,当我看到一个内部的产品演示。他们提出了一个1000万行地址数据库,并搜索名称,地址和电话号码的碎片,当点击搜索按钮,结果回来了一秒钟 - 所有在笔记本上完成!从我告诉他们,他们经常与SAP或CRM系统集成,以提高搜索时间,当呼叫中心代理只是理解名称或地址的呼叫者的片段。



所以无论如何,我可能不会得到更好的描述。如果你需要这样的东西,你一定要去看看这个。 Google翻译确实可以将他们的网站从德语翻译成英语,因此这可能是一个好的开始。 / p>

解决方案

我没有这个具体的要求,但我的经验告诉我,Lucene可以做的伎俩,虽然也许不是独立的。我一定会通过Solr使用它,如Michael Della Bitta在第一个答案中所描述的。



简单来说,Solr让你定义自定义的FieldTypes。它们由索引时间分析器和查询时间分析器组成。分析器找出了如何处理文本,并且每个由一个Tokenizer和零到多个TokenFilters组成。 Tokenizer将您的文本拆分成块,然后每个TokenFilter可以添加,减去或修改令牌。



因此,字段可能会导致与原始文本完全不同的索引,包括多个标记(如有必要)。所以你想要的是原始文本的多令牌副本,你可以通过发送Lucene类似my_ngram_field:sledge来查询。不涉及通配符: - )



然后,您遵循类似于solrconfig.xml文件中提供的前缀搜索的模型:

 < fieldType name =prefix_tokenclass =solr.TextFieldpositionIncrementGap =1> 
< analyzer type =index>
< tokenizer class =solr.WhitespaceTokenizerFactory/>
< filter class =solr.LowerCaseFilterFactory/>
< filter class =solr.EdgeNGramFilterFactoryminGramSize =1maxGramSize =20/>
< / analyzer>
< analyzer type =query>
< tokenizer class =solr.WhitespaceTokenizerFactory/>
< filter class =solr.LowerCaseFilterFactory/>
< / analyzer>
< / fieldType>

EdgeNGramFilterFactory是如何实现搜索框自动完成的前缀匹配。它需要来自之前阶段的令牌(单个空格分隔的词转换为小写),并将它们扇出到前沿的每个子字符串。



您需要遵循此模式,但将EdgeNGramFilterFactory替换为您自己的模式NGrams在该领域。默认的org.apache.solr.analysis.NGramFilterFactory是一个好的开始,但它做字母转置拼写检查。



使用自己的MyNGramFilterFactory创建自己的FieldType(称为ngram_text)后,只需创建一个您的原始字段和ngram字段如此:

 < field name =titletype =textindexed = truestored =true/> 
< field name =title_ngramstype =ngram_textindexed =truestored =false/>

然后告诉它将原始字段复制到一个奇怪的字段:

 < copyField source =titledest =title_ngrams/> 

好吧,现在当你搜索title_ngrams:sledge时,你应该得到一个包含这个。然后在查询的字段列表中,您只需要告诉它检索名为title的字段,而不是字段title_ngrams。



这应该足以让您适合的东西在一起,调整到惊人的性能水平相当容易。在一个旧的工作,我们有一个数据库超过一千万产品与大型HTML描述,并设法让Lucene做标准查询和在200毫秒以下的中型服务器处理几十个同时查询的拼写检查。当你有很多用户,缓存踢,使它尖叫!



哦,增量(虽然不是实时的)索引是一个好习惯。它甚至可以在高负载下进行,因为它会在后台创建和优化新的索引,并在交换之前对其进行自动转移。非常光滑。



祝你好运! p>

Are there any open source or commercial tools available that allow for text fragment indexing of database contents and can be queried from Java?

Background of the question is a large MySQL database table with several hundred thousand records, containing several VARCHAR columns. In these columns people would like to search for fragments of the contents, so a fulltext index (which is based on word boundaries) would not help.

EDIT: [Added to make clear why these first suggestions would not solve the problem:]

This is why MySQL's built in fulltext index will not do the job, and neither will Lucene or Sphinx, all of which were suggested in the answers. I already looked at both those, but as far as I can tell, these are based on indexing words, excluding stop words and doing all sorts of sensible things for a real fulltext search. However this is not suitable, because I might be looking for a search term like "oison" which must match "Roisonic Street" as well as "Poison-Ivy". The key difference here is that the search term is just a fragment of the column content, that need not be delimited by any special characters or white space.

EDIT2: [Added some more background info:] The requested feature that is to be implemented based on this is a very loose search for item descriptions in a merchandise management system. Users often do not know the correct item number, but only part of the name of the item. Unfortunately the quality of these descriptions is rather low, they come from a legacy system and cannot be changed easily. If for example people were searching for a sledge hammer they would enter "sledge". With a word/token based index this would not find matches that are stored as "sledgehammer", but only those listen "sledge hammer". There are all kinds of weird variances that need to be covered, making a token based approach impractical.

Currently the only thing we can do is a LIKE '%searchterm%' query, effectively disabling any index use and requiring lots of resources and time.

Ideally any such tool would create an index that allowed me to get results for suchlike queries very quickly, so that I could implement a spotlight-like search, only retrieving the "real" data from the MySQL table via the primary key when a user picks a result record.

If possible the index should be updatable (without needing a full rebuild), because data might change and should be available for search immediately by other clients.

I would be glad to get recommendations and/or experience reports.

EDIT3: Commercial solution found that "just works" Even though I got a lot of good answers for this question, I wanted to note here, that in the end we went with a commercial product called "QuickFind", made and sold by a German company named "HMB Datentechnik". Please note that I am not affiliated with them in any way, because it might appear like that when I go on and describe what their product can do. Unfortunately their website looks rather bad and is German only, but the product itself is really great. I currently have a trial version from them - you will have to contact them, no downloads - and I am extremely impressed.

As there is no comprehensive documentation available online, I will try and describe my experiences so far.

What they do is build a custom index file based on database content. They can integrate via ODBC, but from what I am told customers rarely do that. Instead - and this is what we will probably do - you generate a text export (like CSV) from your primary database and feed that to their indexer. This allows you to be completely independent of the actual table structure (or any SQL database at all); in fact we export data joined together from several tables. Indexes can be incrementally updated later on the fly.

Based on that their server (a mere 250kb or so, running as a console app or Windows service) serves listens for queries on a TCP port. The protocol is text based and looks a little "old", but it is simple and works. Basically you just pass on which of the available indexes you want to query and the search terms (fragments), space delimited. There are three output formats available, HTML/JavaScript array, XML or CSV. Currently I am working on a Java wrapper for the somewhat "dated" wire protocol. But the results are fantastic: I currently have a sample data set of approximately 500.000 records with 8 columns indexed and my test application triggers a search across all 8 columns for the contents of a JTextField on every keystroke while being edited and can update the results display (JTable) in real-time! This happens without going to the MySQL instance the data originally came from. Based on the columns you get back, you can then ask for the "original" record by querying MySQL with the primary key of that row (needs to be included in the QuickFind index, of course).

The index is about 30-40% the size of the text export version of the data. Indexing was mainly bound by disk I/O speed; my 500.000 records took about a minute or two to be processed.

It is hard to describe this as I found it even hard to believe when I saw an in-house product demo. They presented a 10 million row address database and searched for fragments of names, addresses and phone numbers and when hitting the "Search" button, results came back in under a second - all done on a notebook! From what I am told they often integrate with SAP or CRM systems to improve search times when call center agents just understand fragments of the names or addresses of a caller.

So anyway, I probably won't get much better in describing this. If you need something like this, you should definitely go check this out. Google Translate does a reasonably good job translating their website from German to English, so this might be a good start.

解决方案

I haven't had this specific requirement myself, but my experience tells me Lucene can do the trick, though perhaps not standalone. I'd definitely use it through Solr as described by Michael Della Bitta in the first answer. The link he gave was spot on - read it for more background.

Briefly, Solr lets you define custom FieldTypes. These consist of an index-time Analyzer and a query-time Analyzer. Analyzers figure out what to do with the text, and each consists of a Tokenizer and zero to many TokenFilters. The Tokenizer splits your text into chunks and then each TokenFilter can add, subtract, or modify tokens.

The field can thus end up indexing something quite different from the original text, including multiple tokens if necessary. So what you want is a multiple-token copy of your original text, which you query by sending Lucene something like "my_ngram_field:sledge". No wildcards involved :-)

Then you follow a model similar to the prefix searching offered up in the solrconfig.xml file:

<fieldType name="prefix_token" class="solr.TextField" positionIncrementGap="1">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>
</fieldType>

The EdgeNGramFilterFactory is how they implement prefix matching for search box autocomplete. It takes the tokens coming from the previous stages (single whitespace-delimited words transformed into lower case) and fans them out into every substring on the leading edge. sledgehammer = s,sl,sle,sled,sledg,sledge,sledgeh, etc.

You need to follow this pattern, but replace the EdgeNGramFilterFactory with your own which does all NGrams in the field. The default org.apache.solr.analysis.NGramFilterFactory is a good start, but it does letter transpositions for spell checking. You could copy it and strip that out - it's a pretty simple class to implement.

Once you have your own FieldType (call it ngram_text) using your own MyNGramFilterFactory, just create your original field and the ngram field like so:

    <field name="title" type="text" indexed="true" stored="true"/>
    <field name="title_ngrams" type="ngram_text" indexed="true" stored="false"/>

Then tell it to copy the original field into the fancy one:

<copyField source="title" dest="title_ngrams"/>

Alright, now when you search "title_ngrams:sledge" you should get a list of documents that contain this. Then in your field list for the query you just tell it to retrieve the field called title rather than the field title_ngrams.

That should be enough of a nudge to allow you to fit things together and tune it to astonishing performance levels rather easily. At an old job we had a database with over ten million products with large HTML descriptions and managed to get Lucene to do both the standard query and the spellcheck in under 200ms on a mid-sized server handling several dozen simultaneous queries. When you have a lot of users, caching kicks in and makes it scream!

Oh, and incremental (though not real-time) indexing is a cinch. It can even do it under high loads since it creates and optimizes the new index in the background and autowarms it before swapping it in. Very slick.

Good luck!

这篇关于如何在数据库中搜索文本片段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆