用于文本分析的算法或库,特别是:主导词、跨文本短语和文本集合 [英] Algorithms or libraries for textual analysis, specifically: dominant words, phrases across text, and collection of text

查看:26
本文介绍了用于文本分析的算法或库,特别是:主导词、跨文本短语和文本集合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个项目,我需要分析一页文本和文本页的集合来确定主导词.我想知道是否有一个库(更喜欢 c# 或 java)来为我处理繁重的工作.如果没有,是否有一种或多种算法可以实现我的以下目标.

I'm working on a project where I need to analyze a page of text and collections of pages of text to determine dominant words. I'd like to know if there is a library (prefer c# or java) that will handle the heavy lifting for me. If not, is there an algorithm or multiple that would achieve my goals below.

我想做的类似于根据您在网络上找到的 url 或 rss 提要构建的词云,但我不想要可视化.它们一直用于分析总统候选人的演讲,以了解主题或最常用的词是什么.

What I want to do is similar to word clouds built from a url or rss feed that you find on the web, except I don't want the visualization. They are used all the time for analyzing the presidential candidate speeches to see what the theme or most used words are.

复杂的是,我需要对数千个短文档执行此操作,然后是这些文档的集合或类别.

The complication, is that I need to do this on thousands of short documents, and then collections or categories of these documents.

我最初的计划是将文档解析出来,然后过滤常用词——of、the、he、she 等.然后计算剩余词在文本中出现的次数(以及整个集合/类别).

My initial plan was to parse the document out, then filter common words - of, the, he, she, etc.. Then count the number of times the remaining words show up in the text (and overall collection/category).

问题是以后想处理词干、复数形式等.我也想看看有没有办法识别重要的词组.(不是单词的数量,而是词组的数量是2-3个单词)

The problem is that in the future, I would like to handle stemming, plural forms, etc.. I would also like to see if there is a way to identify important phrases. (Instead of a count of a word, the count of a phrase being 2-3 words together)

任何有关策略、库或算法的指导都将不胜感激.

Any guidance on a strategy, libraries or algorithms that would help are appreciated.

推荐答案

您正在做的一个选项是词频到逆文档频率,或 tf-idf.在此计算中,最强项将具有最高权重.检查这里:http://en.wikipedia.org/wiki/Tf-idf

One option for what you're doing is term frequency to inverse document frequency, or tf-idf. The strongest terms will have the highest weighting under this calculation. Check if out here: http://en.wikipedia.org/wiki/Tf-idf

另一种选择是使用类似朴素贝叶斯分类器的东西,使用单词作为特征,并找出文本中最强的特征来确定文档的类别.这与最大熵分类器类似.

Another option is to use something like a naive bayes classifier using words as features and find what the strongest features are in the text to determine the class of the document. This would work similarly with a maximum entropy classifier.

就执行此操作的工具而言,最好的入门工具是 NLTK,这是​​一个包含大量文档和教程的 Python 库:http://nltk.sourceforge.net/

As far as tools to do this, the best tool to start with would be NLTK, a Python library with extensive documentation and tutorials: http://nltk.sourceforge.net/

对于 Java,请尝试 OpenNLP:http://opennlp.sourceforge.net/

For Java, try OpenNLP: http://opennlp.sourceforge.net/

对于短语内容,请考虑我提供的第二个选项,使用二元组和三元组作为特征,甚至是 tf-idf 中的术语.

For the phrase stuff, consider the second option I offered up by using bigrams and trigrams as features, or even as terms in tf-idf.

祝你好运!

这篇关于用于文本分析的算法或库,特别是:主导词、跨文本短语和文本集合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆