字频算法,自然语言处理 [英] Word frequency algorithm for natural language processing

查看:180
本文介绍了字频算法,自然语言处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果没有拿到学位的信息检索,我想知道是否存在任何的算法来计算的话出现在文本中的给定身体的频率。我们的目标是得到一个什么样的人都这么说了一组文本注释一般的感觉。沿着 Wordle的行。

Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle.

我想:

  • 忽略了文章,代词等('A','一','中','他','他们'等)
  • preserve专有名词
  • 忽略断字,除了软的那种

摘星,这些将是桃色的:

Reaching for the stars, these would be peachy:

  • 在处理所产生和放大器;复数(如喜欢,喜欢,喜欢,喜欢的匹配相同的结果)
  • 的形容词(副词等)与他们的主题分组(大服务,而不是伟大,服务)

我已经尝试使用WORDNET一些基本的东西,但我只是调整的东西盲目,希望它适合我的具体数据。更多的东西通用的将是巨大的。

I've attempted some basic stuff using Wordnet but I'm just tweaking things blindly and hoping it works for my specific data. Something more generic would be great.

推荐答案

您将需要不是一个,而是几个不错的算法,以及下面的线路。

You'll need not one, but several nice algorithms, along the lines of the following.

  • 在无视代词是通过终止列表完成。
  • preserving专有名词?你的意思是,检测命名实体,比如胡佛大坝的,并说这是一个字或复合名词,如节目语言?我会给你一个提示:这是艰难的,但对于同时存在库。寻找NER(命名entitiy识别)和词汇组块。 OpenNLP 是一个Java的工具包,做两个。
  • 忽视断字?你的意思是,就像在换行?使用常规的前pressions,并通过字典查找验证生成的词。
  • 在处理复数/制止:你可以看看到雪球词干。它的伎俩很好。
  • 在分组的形容词与名词的一般是浅层分析的任务。但是,如果你正在寻找专门为定性的形容词(好,坏的,低劣的,令人惊叹的......),你可能有兴趣情感分析 LingPipe 做到这一点,以及更多。
  • ignoring pronouns is done via a stoplist.
  • preserving proper nouns? You mean, detecting named entities, like Hoover Dam and saying "it's one word" or compound nouns, like programming language? I'll give you a hint: that's tough one, but there exist libraries for both. Look for NER (Named entitiy recognition) and lexical chunking. OpenNLP is a Java-Toolkit that does both.
  • ignoring hyphenation? You mean, like at line breaks? Use regular expressions and verify the resulting word via dictionary lookup.
  • handling plurals/stemming: you can look into the Snowball stemmer. It does the trick nicely.
  • "grouping" adjectives with their nouns is generally a task of shallow parsing. But if you are looking specifically for qualitative adjectives (good, bad, shitty, amazing...) you may be interested in sentiment analysis. LingPipe does this, and a lot more.

我很抱歉,我知道你说你想亲吻,但不幸的是,你的要求没有那么容易满足。尽管如此,对于这一切存在的工具,你应该能够只是绑在一起,而不必自己执行任何任务,如果你不想。如果要执行自己的任务,我建议你看看所产生,这是最简单的是。

I'm sorry, I know you said you wanted to KISS, but unfortunately, your demands aren't that easy to meet. Nevertheless, there exist tools for all of this, and you should be able to just tie them together and not have to perform any task yourself, if you don't want to. If you want to perform a task yourself, I suggest you look at stemming, it's the easiest of all.

如果你去使用Java,结合 Lucene的 OpenNLP 工具包。你会得到很好的效果,因为Lucene的已经内置了一个词干和大量的教程。在另一方面,OpenNLP工具箱记录不完整,但你不会需要太离谱了吧。您可能也有兴趣 NLTK ,用Python写的。

If you go with Java, combine Lucene with the OpenNLP toolkit. You will get very good results, as Lucene already has a stemmer built in and a lot of tutorial. The OpenNLP toolkit on the other hand is poorly documented, but you won't need too much out of it. You might also be interested in NLTK, written in Python.

我会说你把你的最后一个要求,因为它涉及浅层分析,将definetly不会impove您的结果。

I would say you drop your last requirement, as it involves shallow parsing and will definetly not impove your results.

嗯,顺便说一句。那文档词频-事你要找的人被称为 TF-IDF 的确切期限。这是pretty的多少来寻找文档频率方面的最佳途径。为了正确地做到这一点,你就不会使用multidimenional矢量矩阵得到解决。

Ah, btw. the exact term of that document-term-frequency-thing you were looking for is called tf-idf. It's pretty much the best way to look for document frequency for terms. In order to do it properly, you won't get around using multidimenional vector matrices.

...是的,我知道。服用后一个研讨会,IR,我对谷歌尊重更大。做一些东西在IR之后,我对他们的敬意下跌只是那么快,但。

... Yes, I know. After taking a seminar on IR, my respect for Google was even greater. After doing some stuff in IR, my respect for them fell just as quick, though.

这篇关于字频算法,自然语言处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆