识别重要文件 [英] Identification of the important document
问题描述
我在Java中有一组文本文档.我必须使用计算机来识别最重要的文档(就像专家会识别的一样).
例如.我有10本关于Java的书,系统将Java完整参考文献标识为最重要的文档或最相关的文档.(基于与Java维基百科页面的相似性)
一种方法是拥有参考文档,并找到该文档与手头的文档集之间的相似之处(如前面的示例中所述).并提供结果说,具有最大相似性的是最重要的文档.
我想确定其他更有效的方法来执行此操作.请建议其他查找相关文档的方法(如果可能,以无监督的方式).
I have a set of text documents in java . I have to identify the most important document (just as what an expert would identify) using a computer.
eg. I have 10 books on java , the system identifies Java complete reference as the most important document or the most relevant.(based on similarities with the wikipedia page about java)
One method would be to have a reference document and find similarities between this document and the set of documents at hand (as mentioned in the previous example). And provide a result saying the one which has maximum similarity is the most important docuemnt .
I want to identify other more efficient methods of performing this. please suggest other methods for finding the relevant document (in a unsupervised way if possible) .
推荐答案
您正在谈论排名的全文搜索,请尝试查看lucene的全文.文字搜索引擎:
http://incubator.apache.org/lucene.net/ [
You are talking about ranked full text search, try looking at lucene the full text search engine:
http://incubator.apache.org/lucene.net/[^]
这篇关于识别重要文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!