语音搜索印度语言 [英] Phonetic search for Indian languages

查看:91
本文介绍了语音搜索印度语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的android应用程序中以语音方式比较字符串.但是这里的特殊情况是,我想比较用英语写的印度语言单词.例如,我要检查"Edhu","Adhu","Yethu"在语音上是否相等,它们在泰米尔语中的含义相同.但是使用英语书写印度语言的人会使用不同的拼写来拼写这个单词.在这种情况下,我该如何比较单词?

I want to compare strings phonetically in my android app. But the special case here is, I want to compare Indian language words written in English. For example, I want to check if "Edhu" "Adhu" "Yethu" are phonetically equal, they all mean the same in Tamil language. But people who use English script to write Indian languages use different spellings to make the word. How do I compare words in this case?

我尝试了Levenshtein.但是我不确定如何将返回的数字转换为相等的数字.

I tried out Levenshtein. But I am not sure how to convert the number it returns to the equality.

我尝试了Soundex,当单词的第一个字母更改时,Soundex编码不相同.但是它能够找出相似的听起来部分.我不明白它是如何工作的.

I tried out Soundex, Soundex codes are not the same when the first letter of the word changes. But it is able to figure out the similar sounding parts. I don't understand how it works.

 soundex.encode("Yethu")  (soundex.encode("Edhu"))  (soundex.encode("adhu")) 
 Y300                       E300                       A300

推荐答案

据我了解,您想使用英语写的单词,以语音方式分解它们,然后将拼写不同但具有相同语音表示形式的单词归为一组.

As I understand it you want to take words written in English, decompose them phonetically, and then group together words that are spelled differently, but have the same Phonetic representations.

对于此SoundEx是90%的解决方案,只要拼写英语单词的人们在将单词从泰米尔语翻译为英语时实际上使用正确的辅音即可.

For this SoundEx is a 90% solution, provided that the people who are spelling the words in English are actually using the correct consonants when they are translating the words from Tamil to English.

您应该只能从SoundEx表示形式中删除第一个值,并在第一个字母是元音时将其用作编码.

You should be able just to drop the first value from the SoundEx representation and use that as your encoding when the first letter is a vowel.

原因是SoundEx( https://en.wikipedia.org/wiki/Soundex )执行其编码仅出现在辅音中.它会丢弃所有的元音加上h和w-除非-元音是单词中的第一个字母-解释了为什么您的值都稍有不同,但仅在第一个字母的编码中.

The reason is that SoundEx ( https://en.wikipedia.org/wiki/Soundex ) performs its encodings only on the consonants in the words that it is presented with. It throws away all the vowels plus h and w - Unless - the Vowel is the first letter in the word - which explains why your values are all slightly different, but only in the first letter's encoding.

对于零,SoundEx编码根据定义是1个字母和3个数字(仅1到6),每个单词(d或t)中只有1个辅音,而SoundEx都将它们映射到数字3.没有更多的辅音,我相信它会为顺应性添加2个零.这样你会得到Letter300

As for your zeros, SoundEx encodings are by definition 1 letter and 3 numbers( 1 through 6 only), you only have 1 consonant in each word (d or t) and SoundEx maps both of them to the number 3. since there are no more consonants, I believe it adds 2 zeros for compliance. thus you get Letter300

如果您要继续在您的应用程序中使用SoundEx,则应记住,它只能根据其Letter Number(1-6)Number(1)为您提供26 * 6 * 6 * 6 = 5616种独特的编码-6)数字(1-6)方案.这意味着语音编码将不是唯一的,并且某些根本不同的单词将具有碰撞的SoundEx编码.

If you are going to continue to use SoundEx for your app you should bare in mind that it can only give you 26*6*6*6 = 5616 unique encodings based on its Letter Number(1-6) Number(1-6) Number(1-6) scheme. Which means that the phonetic encodings will not be unique and some words that are radically different will have SoundEx encodings that collide.

这篇关于语音搜索印度语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆