我如何才能找到唯一“有趣”从语料的话? [英] How can I find only 'interesting' words from a corpus?

查看:257
本文介绍了我如何才能找到唯一“有趣”从语料的话?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我解析判决。我想知道每个句子的相关内容,大致定义为相对于主体的其余部分半唯一的话。同样的事情也亚马逊的统计学上不可能的短语,这似乎(通常)通过文字的古怪的字符串传递一本书的人物。

I am parsing sentences. I want to know the relevant content of each sentence, defined loosely as "semi-unique words" in relation to the rest of the corpus. Something similar to Amazon's "statistically improbable phrases", which seem to (often) convey the character of a book through oddball strings of words.

我第一遍是开始做一个常用词表。这击倒了方便的像 A 等,显然,事实证明,这个名单变得很长。

My first pass was to start making a common words list. This knocks out the easy ones like a, the, from, etc. Obviously, it turns out that this list gets quite long.

一个想法是生成这个列表:请语料库词频的直方图,以及砍掉的前10%,或类似的东西(即发生700次从, 600次,但小额只有50,这是截止,因此相关下)。

One idea is to generate this list: Make a histogram of the corpus' word frequencies, and lop off the top 10% or something similar (IE the occurs 700 times, from 600 times, but micropayments only 50, which is under the cutoff and therefore relevant).

另外algorithim我今天刚刚得知从黑客新闻的铁蛋白IDF ,看起来像它可能是有益的。

Another algorithim I just learned about from Hacker News today is the Tf idf, which looks like it could be helpful.

还有什么其他办法会比我的两个想法更好地工作?

What other approaches would work better than my two ideas?

推荐答案

看看的这篇文章。(字级统计数据:文学文本和符号序列查找的关键字的出版的物理学英文内容的)

Take a look at this article (Level statistics of words: Finding keywords in literary texts and symbolic sequences, published in Phys. Rev. E).

在第一页,其标题一起在图片说明关键的观察。在堂吉诃德的,字,而是和堂吉诃德的出现有相似的频率,但它们的光谱有很大的不同(出现堂吉诃德的聚集而出现的,而是更均匀地间隔) 。因此,堂吉诃德可以被归类为一个有趣的单词(关键词),而,但将被忽略。

The picture on the first page together with its caption explain the crucial observation. In Don Quixote, the words "but" and "Quixote" appear with similar frequencies, but their spectra are quite different (occurrences of "Quixote" are clustered while occurrences of "but" are more evenly spaced). Therefore, "Quixote" can be classified as an interesting word (keyword) while "but" is ignored.

这可能是也可能不是你要找的,但我想这不会伤害到熟悉这个结果。

It might or might not be what you're looking for, but I guess it won't hurt to be familiar with this result.

这篇关于我如何才能找到唯一“有趣”从语料的话?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆