如何使用神经网络创建文本到语音 [英] How to create text-to-speech with neural network

查看:98
本文介绍了如何使用神经网络创建文本到语音的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为一种名为卡纳达语"的语音语言创建一个文本到语音系统,我计划用神经网络训练它.输入是一个词/短语,而输出是相应的音频.

I am creating a Text to Speech system for a phonetic language called "Kannada" and I plan to train it with a Neural Network. The input is a word/phrase while the output is the corresponding audio.

在实现网络时,我认为输入应该是单词/短语的分段字符,因为输出发音仅取决于组成单词的字符,不像英语,我们有静默词和词性考虑.但是,我不知道应该如何训练输出.

While implementing the Network, I was thinking the input should be the segmented characters of the word/phrase as the output pronunciation only depends on the characters that make up the word, unlike English where we have slient words and Part of Speech to consider. However, I do not know how I should train the output.

由于我的数据集是单词/短语和相应的 MP3 文件的集合,我想将这些文件转换为 WAV,使用 pydub 来处理所有音频文件.

Since my Dataset is a collection of words/phrases and the corrusponding MP3 files, I thought of converting these files to WAV using pydub for all audio files.

from pydub import AudioSegment
sound = AudioSegment.from_mp3("audio/file1.mp3")
sound.export("wav/file1.wav", format="wav")

接下来,我打开 wav 文件并将其转换为值在 0 和 1 之间的规范化字节数组.

Next, I open the wav file and convert it to a normalized byte array with values between 0 and 1.

import numpy as np
import wave

f = wave.open('wav/kn3.wav', 'rb')
frames = f.readframes(-1)

#Array of integers of range [0,255]
data = np.fromstring(frames, dtype='uint8')

#Normalized bytes of wav
arr  = np.array(data)/255

我应该如何训练?

从这里开始,我不确定如何使用输入文本进行训练.因此,我需要在 First 和 Last 层中具有可变数量的输入和输出神经元,因为每个输入的字符数(第一层)和相应波(最后一层)的字节数都会发生变化.

From here, I am not sure how to train this with the input text. From this, I would need a variable number of input and output neurons in the First and Last layers as the number of characters (1st layer) and the bytes of the corresponding wave (Last layer) change for every input.

由于 RNN 处理此类可变数据,我认为它在这里会派上用场.

Since RNNs deal with such variable data, I thought it would come in handy here.

如果我错了,请纠正我,但神经网络的输出实际上是介于 0 和 1 之间的概率值.但是,我们不是在处理分类问题.音频可以是任何东西,对吧?就我而言,输出"应该是对应于 WAV 文件的字节向量.因此,每个单词将有大约 40,000 个这些值在 0 到 255 之间(没有标准化步骤).如何训练这些语音数据?任何建议表示赞赏.

Correct me if I am wrong, but the output of Neural Networks are actually probability values between 0 and 1. However, we are not dealing with a classification problem. The audio can be anything, right? In my case, the "output" should be a vector of bytes corrusponding to the WAV file. So there will be around 40,000 of these with values between 0 and 255 (without the normalization step) for every word. How do I train this speech data? Any suggestions are appreciated.

EDIT 1:回应亚伦的评论

据我所知,音素是语言的基本发音.那么,为什么我需要一个神经网络来用语音来映射音素标签呢?我不能只是说,每当你看到这个字母表时,就像this一样发音".毕竟,卡纳达语这种语言是有语音的:没有无声词.所有单词的发音都与它们的拼写方式相同.那么,神经网络在这方面有何帮助?

From what I understand, Phonemes are the basic sounds of the language. So, why do I need a neural network to map phoneme labels with speech? Can't I just say, "whenever you see this alphabet, pronounce it like this". After all, this language, Kannada, is phonetic: There are no silent words. All words are pronounced the same way they are spelled. How would a Neural Network help here then?

在输入新文本时,我只需要将其分解为相应的字母(也是音素)并检索它的文件(从 WAV 转换为原始字节数据).现在,将字节合并在一起并将其转换为 wav 文件.

On input of a new text, I just need to break it down to the corresponding alphabets (which are also the phonemes) and retrieve it's file (converted from WAV to raw byte data). Now, merge the bytes together and convert it to a wav file.

这是不是太简单了?我在这里错过了什么吗?这种特定语言(卡纳达语)的神经网络有什么意义?

Is this this too simplistic? Am I missing something here? What would be the point of a Neural Network for this particular language (Kannada) ?

推荐答案

这不是微不足道的,需要特殊的架构.您可以在 DeepMind百度.

It is not trivial and requires special architecture. You can read the description of it in a publications of DeepMind and Baidu.

您可能还想研究现有的wavenet训练实现.

总的来说,纯粹的端到端语音合成仍然行不通.如果您对文本转语音很认真,最好研究传统系统,例如 merlin.

Overall, pure end-to-end speech synthesis is still not working. If you are serious about text-to-speech it is better to study conventional systems like merlin.

这篇关于如何使用神经网络创建文本到语音的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆