C#如何从MS Word文档中仅提取单词 [英] c# how to extract only words from ms word document

查看:137
本文介绍了C#如何从MS Word文档中仅提取单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用c#阅读ms单词文档,我只希望单词(大写和小写)而不是空格,逗号,数字,特殊字符,符号等.请通过代码为我提供一个好的解决方案.在此先感谢.

i am reading a ms word doc using c#, i want only words(upper case and lower case) not space,comma,numbers,special characters,symbols etc. kindly help me with a good solution with code. thanks in advance.

推荐答案


可靠的专业级解决方案需要大量编程,而不是简单的任务.您可以在我的免费语义分析器中在线找到一个很好的示例,该分析器从任意文本(btw,多语言)中提取单词和句子,然后应用一致性计算器来计算单词出现的频率:String.Split().您将获得一个字符串数组,其中包含文本中的单词.在实际解决方案中,您必须做更多的字符串处理,例如,替换尾随空格    "如上所述,整个生产级解决方案远远超出了单篇文章的范围,而且还是主题/领域特定的.您可能应该从简单的原型开始,然后对其进行修剪以适合您的特定情况.为了满足您的迫切需求,您可以使用我的免费在线语义分析器,该分析器具有合理的准确性.

亲切的问候,
AB
Hi,
A reliable, professional-grade solution requires a lot of programming, and is not a trivial task. One good example you can find online in my free Semantic Analyzer, which extracts words and sentences from arbitrary text (btw, multilingual) and then apply concordance calculator to compute the frequency of word occurences: Semantic Analyzer[^]

In general, you first must get a string containing the plain text of interest (no formatting etc), then remove all special characters (like ",", ":", ";", etc.) using either String.Replace() or regular expression, then apply String.Split() using " " separator. You will get an array of strings containing words in the text. In real world solution, you must do much more of string processing, for e.g., replacing trailing blank spaces "     " with just a single one " ", etc. As mentioned above, entire production-grade solution goes far beyond the boundary of just a single article, and is also subject/domain-specific. You should probably start with simple proto and then trim it to fit your particular case. For your immediate needs, you can use my free online semantic analyzer, which provides a reasonable accuracy.

Kind regards,
AB


这篇关于C#如何从MS Word文档中仅提取单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆