sphinx-4 aligner会跳过诸如"you","in"和带破折号的单词之类的普通单词-为什么? [英] sphinx-4 aligner skips plain words like `you`, `in` and words with dashes - why?

查看:136
本文介绍了sphinx-4 aligner会跳过诸如"you","in"和带破折号的单词之类的普通单词-为什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试对齐简单文本.以下是文本和音频文件的链接:
http://s000.tinyupload.com/?file_id=48044768133759453374
http://s000.tinyupload.com/?file_id=99891199139563396901

I'm trying to align simple text. Here are the links to text and audio files:
http://s000.tinyupload.com/?file_id=48044768133759453374
http://s000.tinyupload.com/?file_id=99891199139563396901

以下是配置设置:

private static final String ACOUSTIC_MODEL_PATH =
        "resource:/edu/cmu/sphinx/models/en-us/en-us";
private static final String DICTIONARY_PATH =
        "resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict";

我得到的输出如下(省略号由我添加):

The output I get is the following (ellipsis are added by me):

- ï
- ¿in
  a                         [11250:11330]
  standard                  [11330:11920]
  shopping                  [11920:12440]
  centre                    [12440:13020]
- you
  can                       [13380:13730]
  ...
  shops                     [15170:15790]
- you
  can                       [16620:16890]
  buy                       [16890:17140]
  ...
  and                       [26920:27230]
  suits                     [27190:27220]
- there’s
  a                         [29160:29210]
  sportswear                [29210:29980]
  ...
  clothes                   [33330:33360]
- t-shirts
  shorts                    [35560:36320]
  jumpers                   [36630:37410]
  ...
  for                       [41860:42010]

由于某种原因,您可以看到它:

As you can see for some reason it:

  • 在第一个a
  • 之前无法识别in
  • you
  • 的多个实例没有计时
  • 无法识别there's,而是将其标识为there’s
  • 没有时间对带有破折号的单词(例如t-shirts
  • )进行计时
  • didn't recognize in before the first a
  • no timing for multiple instances of you
  • didn't recognize there's, instead it identified it as there’s
  • no timing for words with dashes, like t-shirts

有什么方法可以配置狮身人面像以提供出现的时间?

Is there any way I can configure sphinx to provide timings for there occurrences?

推荐答案

一些评论

在第一个a之前无法识别

didn't recognize in before the first a

您的文本文件具有BOM标记,对齐器不知道该标记.最好在对齐之前将其删除

Your text file has BOM mark which is uknown to aligner. It is better to remove it before alignment

不认识那里,而是将其标识为那里

didn't recognize there's, instead it identified it as there’s

您的文本使用UTF-8撇号,这对于对齐器来说是未知的.您最好将它们转换为与ASCII等价的

Your text uses UTF-8 apostrophes which are unknown to aligner. You should better convert them to ASCII equivalent

没有时间对带有破折号的单词(例如T恤衫)

no timing for words with dashes, like t-shirts

词典中缺少这些单词.您可以在对齐之前将它们添加到字典中,或指定g2p模型以将其转换为语音.

Those words are missing in the dictionary. You can add them to the dictionary before alignment or specify g2p model to convert them to phonetics.

这篇关于sphinx-4 aligner会跳过诸如"you","in"和带破折号的单词之类的普通单词-为什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆