如何计算字母频率相似度? [英] How to compute letter frequency similarity?

查看:76
本文介绍了如何计算字母频率相似度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给出此数据(两种语言的相对字母频率):

Given this data (relative letter frequency from both languages):

spanish => 'e' => 13.72, 'a' => 11.72, 'o' => 8.44, 's' => 7.20, 'n' => 6.83,
english => 'e' => 12.60, 't' => 9.37, 'a' => 8.34, 'o' => 7.70, 'n' => 6.80,

然后计算这是一个测试"字符串的字母频率会给我:

And then computing the letter frequency for the string "this is a test" gives me:

"t"=>21.43, "s"=>14.29, "i"=>7.14, "r"=>7.14, "y"=>7.14, "'"=>7.14, "h"=>7.14, "e"=>7.14, "l"=>7.14

那么,将给定的字符串字母频率与一种语言进行匹配(并尝试检测该语言)的一种好方法是什么?我已经看过(并已经测试过)一些使用levenshtein距离的示例,并且在添加更多语言之前,它似乎还可以正常工作.

So, what would be a good approach for matching the given string letter frequency with a language (and try to detect the language)? I've seen (and have tested) some examples using levenshtein distance, and it seems to work fine until you add more languages.

"this is a test" gives (shortest distance:) [:english, 13] ...
"esto es una prueba" gives (shortest distance:) [:spanish, 13] ...

推荐答案

您是否考虑过使用余弦相似度确定两个向量之间的相似度?

Have you considered using cosine similarity to determine the amount of similarity between two vectors?

第一个向量是从测试字符串(待分类)中提取的字母频率,第二个向量是针对特定语言的.

The first vector would be the letter frequencies extracted from the test string (to be classified), and the second vector would be for a specific language.

您当前正在提取单个字母的频率(字母组合).我建议提取更高阶的 n-gram ,例如双字母组或三字母组(甚至更大)如果您有足够的培训数据).例如,对于二元字体,您可以计算"aa","ab","ac" ..."zz"的频率,这比仅考虑单个字符频率时可以提取更多的信息.

You're currently extracting single letter frequencies (unigrams). I would suggest extracting higher order n-grams, such as bigrams or trigrams (and even larger if you had enough training data). For example, for bigrams you would compute the frequencies of "aa", "ab", "ac" ... "zz", which will allow you to extract more information than if you were just considering single character frequencies.

但是要小心,因为当使用高阶n-gram时,您需要更多的训练数据,否则对于以前从未见过的字符组合,您将有很多0值.

Be careful though, because you need more training data when you use higher order n-grams otherwise you will have many 0-values for character combinations you haven't seen before.

此外,第二种可能性是使用 tf-idf (术语频率逆文档频率)加权,而不是纯字母(术语)频率.

In addition, a second possibility is to use tf-idf (term-frequency inverse-document-frequency) weightings instead of pure letter (term) frequencies.

以下是(非常)短文本的语言识别,它使用机器学习分类器(但还有其他一些不错的信息).

Here is a good slideshow on language identification for (very) short texts, which uses machine learning classifiers (but also has some other good info).

这是一篇简短的论文语言识别方法的比较 在简短的查询样式文本上,您也可能会发现它有用.

Here is a short paper A Comparison of Language Identification Approaches on Short, Query-Style Texts that you might also find useful.

这篇关于如何计算字母频率相似度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆