在C#中找到ms Word文档类别的最佳方法是什么 [英] what is the best way to find the category of ms word document in c#
本文介绍了在C#中找到ms Word文档类别的最佳方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
解决方案
这称为文本的语义分析.最简单的方法是定义特定文档类别通用的一组单词.比您在单词类上对该文档进行统计.然后您选出最佳匹配组.
如果需要更深入的分析,则必须使用同义词库(一种语言的语义图).对于英语,您可以使用以下语言: http://wordnet.princeton.edu/ [ http://www.sersc.org/journals/IJSIP/vol1_no1/papers/07 .pdf [ ^ ], http://en.wikipedia.org/wiki/Document_classification [
i am trying to find the type of ms word document and categorize them for a project and the aim of the project is document clustering(i.e grouping) based on the content of the document.the objective is to achieve semi-supervised learning grouping documents based on both labelled and unlabelled data. and i am reading the document word by word in c#.but i cant find a way to categorize the document based on its content. can anyone give the remedy?. thanks.
解决方案That''s called semantic analysis of a text. The easiest way is to define set of words that are common for a specific document category. Than you make statistics for that document over the word classes. And you elect the best matching group.
If you need more deeply analysis, you have to make use of a thesaurus (a semantic graph of a language). For English you can use this one: http://wordnet.princeton.edu/[^], but it is not common to all cultures to have such thesaurus already made :(
If yo have to go even deeper, you will have to do research. Start here: http://www.sersc.org/journals/IJSIP/vol1_no1/papers/07.pdf[^], http://en.wikipedia.org/wiki/Document_classification[^]
The extension of the word file are .doc/docx you should read all the file from your drive in loop and put their value in string, check the containing values eg: string.contain and categorized accordingly .
Thanks,
Ambesha
It''s not clear what you''re doing reading it "word by word" but if the document is being read using Open XML then you can just get the document properties (CoreFilePropertiesPart) and look for the subject, keywords or category.
这篇关于在C#中找到ms Word文档类别的最佳方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文