印度语言的语音搜索 [英] Phonetic search for Indian languages

查看:22
本文介绍了印度语言的语音搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在我的 android 应用程序中以语音方式比较字符串.但这里的特殊情况是,我想比较用英语写的印度语单词.例如,我想检查Edhu"Adhu"Yethu"在语音上是否相同,它们在泰米尔语中的意思都相同.但是使用英文书写印度语言的人使用不同的拼写来制作这个词.在这种情况下,我如何比较单词?

I want to compare strings phonetically in my android app. But the special case here is, I want to compare Indian language words written in English. For example, I want to check if "Edhu" "Adhu" "Yethu" are phonetically equal, they all mean the same in Tamil language. But people who use English script to write Indian languages use different spellings to make the word. How do I compare words in this case?

我尝试了 Levenshtein.但我不确定如何将它返回的数字转换为等式.

I tried out Levenshtein. But I am not sure how to convert the number it returns to the equality.

我尝试了 Soundex,当单词的第一个字母改变时,Soundex 代码是不一样的.但它能够找出相似的发声部分.我不明白它是如何工作的.

I tried out Soundex, Soundex codes are not the same when the first letter of the word changes. But it is able to figure out the similar sounding parts. I don't understand how it works.

 soundex.encode("Yethu")  (soundex.encode("Edhu"))  (soundex.encode("adhu")) 
 Y300                       E300                       A300

推荐答案

据我所知,您希望将用英语书写的单词按语音分解,然后将拼写不同但具有相同语音表示的单词组合在一起.

As I understand it you want to take words written in English, decompose them phonetically, and then group together words that are spelled differently, but have the same Phonetic representations.

对于这个 SoundEx 是一个 90% 的解决方案,前提是用英语拼写单词的人在将单词从泰米尔语翻译成英语时实际上使用了正确的辅音.

For this SoundEx is a 90% solution, provided that the people who are spelling the words in English are actually using the correct consonants when they are translating the words from Tamil to English.

您应该能够从 SoundEx 表示中删除第一个值,并在第一个字母是元音时将其用作编码.

You should be able just to drop the first value from the SoundEx representation and use that as your encoding when the first letter is a vowel.

原因是 SoundEx ( https://en.wikipedia.org/wiki/Soundex ) 执行其编码仅在出现它的单词中的辅音上.它丢弃所有元音加上 h 和 w - 除非 - 元音是单词中的第一个字母 - 这解释了为什么您的值都略有不同,但仅在第一个字母的编码中.

The reason is that SoundEx ( https://en.wikipedia.org/wiki/Soundex ) performs its encodings only on the consonants in the words that it is presented with. It throws away all the vowels plus h and w - Unless - the Vowel is the first letter in the word - which explains why your values are all slightly different, but only in the first letter's encoding.

至于零,SoundEx 编码根据定义是 1 个字母和 3 个数字(仅 1 到 6),每个单词(d 或 t)中只有 1 个辅音,SoundEx 将它们都映射到数字 3.因为没有更多的辅音,我相信它增加了 2 个零以符合要求.这样你就得到了 Letter300

As for your zeros, SoundEx encodings are by definition 1 letter and 3 numbers( 1 through 6 only), you only have 1 consonant in each word (d or t) and SoundEx maps both of them to the number 3. since there are no more consonants, I believe it adds 2 zeros for compliance. thus you get Letter300

如果您打算继续为您的应用程序使用 SoundEx,您应该牢记它只能为您提供 26*6*6*6 = 5616 种基于其 Letter Number(1-6) Number(1) 的唯一编码-6) Number(1-6) 方案.这意味着语音编码不会是唯一的,并且一些完全不同的单词将具有相冲突的 SoundEx 编码.

If you are going to continue to use SoundEx for your app you should bare in mind that it can only give you 26*6*6*6 = 5616 unique encodings based on its Letter Number(1-6) Number(1-6) Number(1-6) scheme. Which means that the phonetic encodings will not be unique and some words that are radically different will have SoundEx encodings that collide.

这篇关于印度语言的语音搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆