在C#编写一个倒排索引的信息检索应用程序 [英] Writing an Inverted Index in C# for an information retrieval application

查看:1219
本文介绍了在C#编写一个倒排索引的信息检索应用程序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写一个内部的应用程序,它包含的文本信息几件,以及大量关于这些作品的文本数据块。数据的这些作品将在数据库中的入门顺序举行(SQL Server中,虽然这可能会改变)。

I am writing an in-house application that holds several pieces of text information as well as a number of pieces of data about these pieces of text. These pieces of data will be held within a database (SQL Server, although this could change) in order of entry.

我希望能够搜索最相关的这些信息,与最相关的这些的是在顶部。我本来看着使用SQL Server全文搜索,但它不是那么灵活了我的其他需要,我曾希望如此看来,我需要开发自己的解决了这一点。

I'd like to be able to search for the most relevant of these pieces of information, with the most relevant of these to be at the top. I originally looked into using SQL Server Full-Text Search but it's not as flexible for my other needs as I had hoped so it seems that I'll need to develop my own solution to this.

据我了解我们需要的是一个倒指数,那么所说的倒排索引的内容要还原的基础上举行的附加信息的结果修改(虽然现在这个可以留给以后的日子,因为我只想倒排索引进行索引从数据库表正文/串提供)。

From what I understand what is needed is an inverted index, then for the contents of said inverted index to be restored and modified based on the results of the additional information held (although for now this can be left for a later date as I just want the inverted index to index the main text from the database table/strings provided).

我在写使用哈希表的键字和值的字,但在事件列表中的Java的代码有裂纹说实话我还是比较新的C#和只真正使用的东西像数据集和处理信息时的DataTable。如果需要我会上传的Java代码很快一旦我清除病毒的这台笔记本电脑。

I've had a crack at writing this code in Java using a Hashtable with the key as the words and the value as a list of the occurrences of the word but in all honesty I'm still rather new at C# and have only really used things like DataSets and DataTables when handling information. If requested I'll upload the Java code soon once I've cleared this laptop of viruses.

如果给定一组条目从表或从字符串列表,怎么可以创建在C#中倒排索引,将最好保存到一个DataSet / DataTable中

If given a set of entries from a table or from a List of Strings, how could one create an inverted index in C# that will preferably save into a DataSet/DataTable?

编辑:我忘了提,我已经尝试Lucene和Nutch的,但需要我自己的解决方案,修改Lucene来满足我的需求将采取比写一个倒排索引更长的时间。我会处理大量的元数据,一旦基本倒排索引完成那会还需要处理的,因此所有我需要的是现在使用倒排索引于一个领域一个基本的全文搜索。最后,一​​个倒排索引的工作是不是让我每天都这样做,它会是巨大的,有它的裂缝。

I forgot to mention that I have already tried Lucene and Nutch, but require my own solution as modifying Lucene to meet my needs would take far longer than writing an inverted index. I'll be handling a lot of meta-data that'll also need handling once the basic inverted index is completed, so all I require for now is a basic full-text search on one area using the inverted index. Finally, working on an inverted index isn't something I get to do every day so it'd be great to have a crack at it.

推荐答案

下面是我在过去成功地使用在C#中的方法粗略概述:

Here's a rough overview of an approach I've used successfully in C# in the past:

 struct WordInfo
 {
     public int position;
     public int fieldID;
 }

 Dictionary<string,List<WordInfo>> invertedIndex=new Dictionary<string,List<WordInfo>>();

       public void BuildIndex()
       {
            foreach (int  fieldID in GetDatabaseFieldIDS())
            {    
                string textField=GetDatabaseTextFieldForID(fieldID);

                string word;

                int position=0;

                while(GetNextWord(textField,out word,ref position)==true)
                {
                     WordInfo wi=new WordInfo();

                     if (invertedIndex.TryGetValue(word,out wi)==false)
                     {
                         invertedIndex.Add(word,new List<WordInfo>());
                     }

                     wi.Position=position;
                     wi.fieldID=fieldID;
                     invertedIndex[word].Add(wi);

                }

            }
        }

注:

GetNextWord()通过现场迭代并返回下一个字和位置。为了实现它看看使用string.IndexOf()和CHAR字符类型检查方法(ISALPHA等)。

GetNextWord() iterates through the field and returns the next word and position. To implement it look at using string.IndexOf() and char character type checking methods (IsAlpha etc).

GetDatabaseTextFieldForID()和GetDatabaseFieldIDS()是自我解释,作为实施必需的。

GetDatabaseTextFieldForID() and GetDatabaseFieldIDS() are self explanatory, implement as required.

这篇关于在C#编写一个倒排索引的信息检索应用程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆