从Python NLTK或其他模块中的任何单词获取音素? [英] Get phonemes from any word in Python NLTK or other modules?

查看:740
本文介绍了从Python NLTK或其他模块中的任何单词获取音素?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python NLTK的命令会吐出已识别单词的音素.例如'see'-> [u'S',u'IY1'],但是对于无法识别的单词会给出错误.例如'seasee'->错误.

Python NLTK has cmudict that spits out phonemes of recognized words. For example 'see' -> [u'S', u'IY1'], but for words that are not recognized it gives an error. For example 'seasee' -> error.

import nltk

arpabet = nltk.corpus.cmudict.dict()

for word in ('s', 'see', 'sea', 'compute', 'comput', 'seesea'):
    try:
        print arpabet[word][0]
    except Exception as e:
        print e

#Output
[u'EH1', u'S']
[u'S', u'IY1']
[u'S', u'IY1']
[u'K', u'AH0', u'M', u'P', u'Y', u'UW1', u'T']
'comput'
'seesea'

是否有没有那个限制但能够找到/猜测任何真实或虚构单词的音素的模块?

Is any there any module that doesn't have that limitation but able to find/guess phonemes of any real or made-up words?

如果没有,我可以通过任何方式对其进行编程吗?我正在考虑进行循环以测试单词中不断增加的部分.例如,在"seasee"中,第一个循环使用"s",下一个循环使用"se",第三个循环使用"sea" ...等等,并运行命令.尽管问题是我不知道该如何发信号,但这是需要考虑的正确音素.例如,"seasee"中的"s"和"sea"都将输出一些有效的音素.

If there is none, is there any way I can program it out? I am thinking about doing loops to test increasing portion of the word. For example in 'seasee', the first loop takes "s", next loop takes 'se', and third takes 'sea'... etc and run the cmudict. Though the problem is I don't know how to signal it's the right phoneme to consider. For example, both 's' and 'sea' in 'seasee' will output some valid phonemes.

工作进度:

import nltk

arpabet = nltk.corpus.cmudict.dict()

for word in ('s', 'see', 'sea', 'compute', 'comput', 'seesea', 'darfasasawwa'):
    try:
        phone = arpabet[word][0]
    except:
        try:
            counter = 0
            for i in word:
                substring = word[0:1+counter]
                counter += 1
                try:
                    print substring, arpabet[substring][0]
                except Exception as e:
                    print e
        except Exception as e:
            print e

#Output
c [u'S', u'IY1']
co [u'K', u'OW1']
com [u'K', u'AA1', u'M']
comp [u'K', u'AA1', u'M', u'P']
compu [u'K', u'AA1', u'M', u'P', u'Y', u'UW0']
comput 'comput'
s [u'EH1', u'S']
se [u'S', u'AW2', u'TH', u'IY1', u'S', u'T']
see [u'S', u'IY1']
sees [u'S', u'IY1', u'Z']
seese [u'S', u'IY1', u'Z']
seesea 'seesea'
d [u'D', u'IY1']
da [u'D', u'AA1']
dar [u'D', u'AA1', u'R']
darf 'darf'
darfa 'darfa'
darfas 'darfas'
darfasa 'darfasa'
darfasas 'darfasas'
darfasasa 'darfasasa'
darfasasaw 'darfasasaw'
darfasasaww 'darfasasaww'
darfasasawwa 'darfasasawwa'

推荐答案

我遇到了同样的问题,我通过递归划分未知对象来解决它(请参见wordbreak)

I encountered the same issue, and I solved it by partitioning unknown recursively (see wordbreak)

import nltk
from functools import lru_cache
from itertools import product as iterprod

try:
    arpabet = nltk.corpus.cmudict.dict()
except LookupError:
    nltk.download('cmudict')
    arpabet = nltk.corpus.cmudict.dict()

@lru_cache()
def wordbreak(s):
    s = s.lower()
    if s in arpabet:
        return arpabet[s]
    middle = len(s)/2
    partition = sorted(list(range(len(s))), key=lambda x: (x-middle)**2-x)
    for i in partition:
        pre, suf = (s[:i], s[i:])
        if pre in arpabet and wordbreak(suf) is not None:
            return [x+y for x,y in iterprod(arpabet[pre], wordbreak(suf))]
    return None

这篇关于从Python NLTK或其他模块中的任何单词获取音素?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆