文本索引算法 [英] Text indexing algorithm

查看:144
本文介绍了文本索引算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为归档系统编写一个C#winform应用程序。系统有一个庞大的数据库,其中一些表将有超过150万条记录。我需要的是一个算法,索引这些记录的内容。主要是,这些文件是Microsoft office,PDF和TXT文档。任何人都可以帮助?无论是与想法,链接,书籍或代码,我感激它:)

I am writing a C# winform application for an archiving system. The system has a huge database where some tables would have more than 1.5 million records. What i need is an algorithm that indexes the content of these records. Mainly, the files are Microsoft office, PDF and TXT documents. anyone can help? whether with ideas, links, books or codes, I appreciate it :)

示例:如果我在数据库中的某个文件夹中搜索单词按照某些标准(例如相关性,修改日期等)获取包含该词的所有文件

example: if i search for the word "international" in a certain folder in the database, i get all the files that contain that word ordered by a certain criteria such as relevance, modifying date...etc

推荐答案

创建,所谓的倒排索引 - 这是搜索引擎如何工作的核心(一个谷歌)。 Apache Lucene可以说是反向索引的最佳库。您有两个选项:

You need to create, what is known as an inverted index - which is at the core of how search engines work (a la Google). Apache Lucene is arguably the best library for inverted indexing. You have 2 options:


  1. Lucene.net - a .NET port of the Java Lucene library.

Lucene.net - Java Lucene库的.NET端口。 /lucene.apache.org/solr/rel =noreferrer> Apache Solr - 一个完整的搜索服务器,使用Lucene库构建,并且易于集成到.NET应用程序中,因为它具有RESTful API。开箱即用的几个功能,如缓存,缩放,拼写检查等。您可以使用优秀的 SolrNet 库。

Apache Solr - a full-fledged search server built using Lucene libs and easily integrable into your .NET application because it has a RESTful API. Comes out-of-the-box with several features such as caching, scaling, spell-checking, etc. You can make life easier for your app-to-Solr interaction using the excellent SolrNet library.

Apache Tika 提供了一个非常广泛的数据/元数据提取工具包使用PDFs,HTMLs,MS Office文档等。一个更简单的选择是IFilter API。有关详细信息,请参见文章。

Apache Tika offers a very extensive data/metadata extraction toolkit working with PDFs, HTMLs, MS Office docs etc. A simpler option would be to the IFilter API. See this article for more details.

这篇关于文本索引算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆