文本索引算法 [英] Text indexing algorithm

查看:69
本文介绍了文本索引算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写一个归档系统中的C#WinForm应用程序。该系统具有一个庞大的数据库,其中一些表将有超过150万的记录。我需要的是一种算法,索引这些记录的内容。主要是,该文件是微软Office,PDF和TXT文档。任何人都可以帮忙吗?是否有想法,链接,书籍或代码,我很感激:)

I am writing a C# winform application for an archiving system. The system has a huge database where some tables would have more than 1.5 million records. What i need is an algorithm that indexes the content of these records. Mainly, the files are Microsoft office, PDF and TXT documents. anyone can help? whether with ideas, links, books or codes, I appreciate it :)

例如:如果我在某个文件夹中的数据库中搜索国际一词,我得到所有含有该词通过一定的条件,如相关性排序,修改日期...等

example: if i search for the word "international" in a certain folder in the database, i get all the files that contain that word ordered by a certain criteria such as relevance, modifying date...etc

推荐答案

您需要的文件创造,所谓的倒排索引 - 这是在核心搜索引擎如何工作(一拉谷歌)。 Apache Lucene是可以说倒立索引最好的图书馆。你有两个选择:

You need to create, what is known as an inverted index - which is at the core of how search engines work (a la Google). Apache Lucene is arguably the best library for inverted indexing. You have 2 options:


  1. Lucene.net - 在Java的Lucene库的.NET端口

  1. Lucene.net - a .NET port of the Java Lucene library.

Apache Solr实现 - 使用Lucene库和易集成到您的.NET应用程序,因为它有一个RESTful API构建一个完整的搜索服务器。 //代码:具有多种功​​能,如高速缓存,缩放,拼写检查等。您可以让生活为你使用优秀的SolrNet 库。

Apache Solr - a full-fledged search server built using Lucene libs and easily integrable into your .NET application because it has a RESTful API. Comes out-of-the-box with several features such as caching, scaling, spell-checking, etc. You can make life easier for your app-to-Solr interaction using the excellent SolrNet library.

的Apache提卡提供了一个非常广泛的数据/元数据提取工具包,PDF文件,HTMLS,MS Office文档等工作更简单的办法是到IFilter的API。请参见有关详细信息,这一文。

Apache Tika offers a very extensive data/metadata extraction toolkit working with PDFs, HTMLs, MS Office docs etc. A simpler option would be to the IFilter API. See this article for more details.

这篇关于文本索引算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆