在C#中找到ms Word文档类别的最佳方法是什么 [英] what is the best way to find the category of ms word document in c#

查看:85
本文介绍了在C#中找到ms Word文档类别的最佳方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找到ms word文档的类型并将其分类为一个项目,该项目的目的是基于文档的内容进行文档聚类(即分组).目标是实现半监督学习根据已标记和未标记的数据对文档进行分组.我正在c#中逐字阅读文档,但是我找不到一种基于文档内容对文档进行分类的方法.谁能提供补救措施?

解决方案

这称为文本的语义分析.最简单的方法是定义特定文档类别通用的一组单词.比您在单词类上对该文档进行统计.然后您选出最佳匹配组.
如果需要更深入的分析,则必须使用同义词库(一种语言的语义图).对于英语,您可以使用以下语言: http://wordnet.princeton.edu/ [ http://www.sersc.org/journals/IJSIP/vol1_no1/papers/07 .pdf [ ^ ], http://en.wikipedia.org/wiki/Document_classification [


i am trying to find the type of ms word document and categorize them for a project and the aim of the project is document clustering(i.e grouping) based on the content of the document.the objective is to achieve semi-supervised learning grouping documents based on both labelled and unlabelled data. and i am reading the document word by word in c#.but i cant find a way to categorize the document based on its content. can anyone give the remedy?. thanks.

解决方案

That''s called semantic analysis of a text. The easiest way is to define set of words that are common for a specific document category. Than you make statistics for that document over the word classes. And you elect the best matching group.
If you need more deeply analysis, you have to make use of a thesaurus (a semantic graph of a language). For English you can use this one:
http://wordnet.princeton.edu/[^], but it is not common to all cultures to have such thesaurus already made :(
If yo have to go even deeper, you will have to do research. Start here: http://www.sersc.org/journals/IJSIP/vol1_no1/papers/07.pdf[^], http://en.wikipedia.org/wiki/Document_classification[^]


The extension of the word file are .doc/docx you should read all the file from your drive in loop and put their value in string, check the containing values eg: string.contain and categorized accordingly .

Thanks,
Ambesha


It''s not clear what you''re doing reading it "word by word" but if the document is being read using Open XML then you can just get the document properties (CoreFilePropertiesPart) and look for the subject, keywords or category.


这篇关于在C#中找到ms Word文档类别的最佳方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆