通过标记化处理多个文本文件的倒排索引 [英] inverted index for multiple text files by tokenization

查看:56
本文介绍了通过标记化处理多个文本文件的倒排索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个文件夹中有多个文本文件.现在,我必须创建一个按字母顺序排序的索引文本文件,其中包含这些文本文件中的所有标记.该文本文件应存储文件名和每个词条出现的频率.例如:
one.txt:我在做我的工作.
two.txt:我有理由(工作)做*这项工作,
three.txt:请帮助我完成这项工作.

现在,posting_file.txt应该类似于:

I am having multiple text files in a folder. Now I have to create a alphabetically sorted indexed text file containing all tokens from those text files. This text file should store the file name and term frequency for each term occurs in the text files. e.g:
one.txt: I am doing my work.
two.txt: I am having the reason (work) to do* this work,
three.txt: Please help me, in doing this work.

Now the posting_file.txt should be like:

am    ->  <one.txt,1>,<two.txt,1>
doing ->  <one.txt,1>,<three.txt,1>
i     ->  <one.txt,1>,<two.txt,1>
.
.
.
.
.
work -> <one.txt,1>,<two.txt,2>,<three.txt,1>




一个人可以通过一个文本框搜索工作"一词,结果应如下所示:




And one can search for the term lets say ''work'' through a text box, the result should display like this:

File Name            Frequency
One.txt              1
Two.txt              2
Three.txt            1



我认为,所有问题都已解决,现在任何人都可以帮助我在c#中查找上述问题代码.

问候!

[edit]固定的代码块-OriginalGriff [/edit]



I think, all the problem has been cleared, now please any one can help me for finding above mentioned problem code in c#.

Regards!

[edit]Code blocks fixed - OriginalGriff[/edit]

推荐答案

例如,您还可以创建"倒排索引" '文件,则在创建 posting_file.txt时,这将加快搜索速度.
You may, for instance, create also an ''inverted index'' file, while you are creating the posting_file.txt, this would speed up the search.


这篇关于通过标记化处理多个文本文件的倒排索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆