占主导地位的话,整个文本短语,和收集的文字:对于文本分析,具体算法或库 [英] Algorithms or libraries for textual analysis, specifically: dominant words, phrases across text, and collection of text

查看:171
本文介绍了占主导地位的话,整个文本短语,和收集的文字:对于文本分析,具体算法或库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我工作的一个项目,我需要分析的文本和文本的网页集合的网页,以确定主导词。我想知道是否有一个图书馆(preFER C#或Java),将处理繁重的我。如果没有,是否有一个算法或多个,将实现我的目标如下。

I'm working on a project where I need to analyze a page of text and collections of pages of text to determine dominant words. I'd like to know if there is a library (prefer c# or java) that will handle the heavy lifting for me. If not, is there an algorithm or multiple that would achieve my goals below.

我想要做的是类似于从一个网址或RSS提要,你在网络上找到,但我不希望可视化建字云。它们被用于所有的时间用于分析presidential候选人的演讲,看看有什么主题或最常用的词。

What I want to do is similar to word clouds built from a url or rss feed that you find on the web, except I don't want the visualization. They are used all the time for analyzing the presidential candidate speeches to see what the theme or most used words are.

并发症,是我需要做的数以千计的短文件,然后集合或这些文件的类别。

The complication, is that I need to do this on thousands of short documents, and then collections or categories of these documents.

我最初的计划是在解析文档出来,然后过滤常用词 - 后,他,她,等再算上次剩余的词出现在文本(和整体的集合/类)的数量。

My initial plan was to parse the document out, then filter common words - of, the, he, she, etc.. Then count the number of times the remaining words show up in the text (and overall collection/category).

现在的问题是,在未来,我想处理而产生,复数形式,等等。我也想看看是否有一种方法来找出重要词组。 (而不是一个字的计数的,一个短语被2-3词语一起计)

The problem is that in the future, I would like to handle stemming, plural forms, etc.. I would also like to see if there is a way to identify important phrases. (Instead of a count of a word, the count of a phrase being 2-3 words together)

这是一个策略,库或算法,这将有助于任何指导的AP preciated。

Any guidance on a strategy, libraries or algorithms that would help are appreciated.

推荐答案

一个选项,供您在做什么是词频逆文档频率,或TF-IDF。最强烈的措辞将根据此计算最高的权重。检查出在这里: http://en.wikipedia.org/wiki/Tf-idf

One option for what you're doing is term frequency to inverse document frequency, or tf-idf. The strongest terms will have the highest weighting under this calculation. Check if out here: http://en.wikipedia.org/wiki/Tf-idf

另一种方法是使用单词作为功能使用的东西就像一个朴素贝叶斯分类器,并找到最强的功能是在文本来确定类的文档。这将同样的工作,最大熵分类。

Another option is to use something like a naive bayes classifier using words as features and find what the strongest features are in the text to determine the class of the document. This would work similarly with a maximum entropy classifier.

至于工具,要做到这一点,最好的工具,开始将NLTK,一个Python库具有丰富的文档和教程: HTTP:/ /nltk.sourceforge.net/

As far as tools to do this, the best tool to start with would be NLTK, a Python library with extensive documentation and tutorials: http://nltk.sourceforge.net/

对于Java,尝试OpenNLP: http://opennlp.sourceforge.net/

For Java, try OpenNLP: http://opennlp.sourceforge.net/

有关的短语的东西,认为我提出了用双字母组和卦为特色,甚至在TF-IDF术语第二个选项。

For the phrase stuff, consider the second option I offered up by using bigrams and trigrams as features, or even as terms in tf-idf.

祝你好运!

这篇关于占主导地位的话,整个文本短语,和收集的文字:对于文本分析,具体算法或库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆