如何在 Ubuntu 20.04 上开始使用 Mozilla TTS 训练自定义语音模型? [英] How do I get started training a custom voice model with Mozilla TTS on Ubuntu 20.04?

查看:160
本文介绍了如何在 Ubuntu 20.04 上开始使用 Mozilla TTS 训练自定义语音模型?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用我录制的音频样本在 Mozilla TTS 中创建自定义语音,但不知道如何开始.Mozilla TTS 项目有文档和教程,但我无法将这些部分放在一起 - 似乎缺少一些初学者需要了解的基本信息.

I'd like to create a custom voice in Mozilla TTS using audio samples I have recorded but am not sure how to get started. The Mozilla TTS project has documentation and tutorials, but I'm having trouble putting the pieces together -- it seems like there's some basic information missing that someone starting out needs to know to get going.

我有一些问题:

  1. 我看到 Mozilla TTS 有一个 Docker 映像,但它的文档涵盖了创建语音并且没有提到培训.我可以使用 Docker 镜像进行训练吗?
  2. 如果我无法使用 Docker 映像进行训练,我如何使用 Python 3 获取在我的系统上运行的 Mozilla TTS 的功能副本?我已尝试按照项目提供的命令进行操作,但出现依赖项错误、版本冲突或关于没有足够权限安装软件包的错误.
  3. 我需要什么信息来训练模型?我需要什么音频格式?我发现我需要一个 metadata.csv 文件——我需要在该文件中放入什么?我在配置文件中自定义了什么?
  4. 大多数配置都引用了 scale_stats.npy 文件——我该如何生成它?
  5. 我如何开展培训?
  1. I see that there is a Docker image for Mozilla TTS, but that the documentation for it covers creating speech and doesn't mention training. Can I use the Docker image for training?
  2. If I can't use the Docker image for training, how do I get a functional copy of Mozilla TTS running on my system with Python 3? I've tried following the commands that the project provides, but I get dependency errors, version conflicts, or errors about not having sufficient permission to install packages.
  3. What information do I need in order to train the model? What audio formats do I need? I see that I need a metadata.csv file -- what do I need to put in that file? What do I customize in the config file?
  4. Most of the configs reference a scale_stats.npy file -- how do I generate this?
  5. How do I run the training?

推荐答案

经过大量的研究和实验,我可以分享我的学习来回答我自己的问题.

After a lot of research and experimentation, I can share my learnings to answer my own questions.

Mozilla TTS docker 镜像确实适合播放,似乎不适合用于训练.至少,即使在容器内运行 shell 时,我也无法进行培训.但是在弄清楚是什么导致 PIP 不满意之后,在 Ubuntu 中启动和运行 Mozilla TTS 的过程变得非常简单.

The Mozilla TTS docker image is really geared for playback and doesn't seem equipped to be used for training. At least, even when running a shell inside the container, I could not get training to work. But after figuring out what was causing PIP to be unhappy, the process of getting Mozilla TTS up and running in Ubuntu turns out to be pretty straightforward.

Mozilla TTS 的文档没有提及任何有关虚拟环境的内容,但恕我直言,它确实应该提及.虚拟环境可确保您机器上基于 Python 的不同应用程序的依赖关系不会发生冲突.

The documentation for Mozilla TTS doesn't mention anything about virtual environments, but IMHO it really should. Virtual environments ensure that dependencies for different Python-based applications on your machine don't conflict.

我在 WSL 上运行 Ubuntu 20.04,所以已经安装了 Python 3.鉴于此,在我的主文件夹中,以下是我用来获取 Mozilla TTS 工作副本的命令:

I'm running Ubuntu 20.04 on WSL, so Python 3 is already installed. Given that, from within my home folder, here are the commands I used to get a working copy of Mozilla TTS:

sudo apt-get install espeak

git clone https://github.com/mozilla/TTS mozilla-tts
python3 -m venv mozilla-tts

cd mozilla-tts
./bin/pip install -e .

这在我的主文件夹中创建了一个名为 ~/mozilla-tts 的文件夹,其中包含 Mozilla TTS 代码.该文件夹设置为虚拟环境,这意味着只要我通过 ~/mozilla-tts/bin/python 和 PIP 通过 ~/mozilla-tts/bin/执行 python 命令pip,Python 将只使用该虚拟环境中存在的包.这消除了运行 pip 时需要 root 用户(因为我们不影响系统范围的包),并确保没有包冲突.得分!

This created a folder called ~/mozilla-tts in my home folder that contains the Mozilla TTS code. The folder is setup as a virtual environment, which means that as long as I execute python commands via ~/mozilla-tts/bin/python and PIP via ~/mozilla-tts/bin/pip, Python will use only the packages that exist in that virtual environment. That eliminates the need to be root when running pip (since we're not affecting system-wide packages), and it ensures no package conflicts. Score!

为了在训练模型时获得最佳结果,您需要:

For the best results when training a model, you will need:

  1. 符合以下条件的短录音(至少 100 个):
    • 16 位单声道 PCM WAV 格式.
    • 在 1 到 10 秒之间.
    • 采样率为 22050 Hz.
    • 尽量减少背景噪音和失真.
    • 在开头、中间和结尾都没有长时间的沉默.

准备音频文件

如果您的音频源格式与 WAV 格式不同,则您需要使用 AudacitySoX 将文件转换为 WAV 格式.您还应该修剪掉来自扬声器的噪音、嗯、啊和其他声音的音频部分,而这些声音并不是您真正在训练的单词.

Preparing the Audio Files

If your source of audio is in a different format than WAV, you will need to use a program like Audacity or SoX to convert the files into WAV format. You should also trim out portions of audio that are just noise, umms, ahs, and other sounds from the speaker that aren't really words you're training on.

如果您的音频源不完美(即有一些背景噪音)、采用不同的格式,或者恰好是更高的采样率或不同的分辨率(例如 24 位、32 位等),您可以执行一些清理和转换.这是一个基于 Mozilla 的早期脚本的脚本TTS 话语论坛:

If your source of audio isn't perfect (i.e. has some background noise), is in a different format, or happens to be a higher sample rate or different resolution (e.g. 24-bit, 32-bit, etc.), you can perform some clean-up and conversion. Here's a script that is based on an earlier script from the Mozilla TTS Discourse forums:

from pathlib import Path

import os
import subprocess
import soundfile as sf
import pyloudnorm as pyln
import sys

src = sys.argv[1]
rnn = "/PATH/TO/rnnoise_demo"

paths = Path(src).glob("**/*.wav")

for filepath in paths:
    target_filepath=Path(str(filepath).replace("original", "converted"))
    target_dir=os.path.dirname(target_filepath)

    if (str(filepath) == str(target_filepath)):
        raise ValueError("Source and target path are identical: " + str(target_filepath))

    print("From: " + str(filepath))
    print("To: " + str(target_filepath))

    # Stereo to Mono; upsample to 48000Hz
    subprocess.run(["sox", filepath, "48k.wav", "remix", "-", "rate", "48000"])
    subprocess.run(["sox", "48k.wav", "-c", "1", "-r", "48000", "-b", "16", "-e", "signed-integer", "-t", "raw", "temp.raw"]) # convert wav to raw
    subprocess.run([rnn, "temp.raw", "rnn.raw"]) # apply rnnoise
    subprocess.run(["sox", "-r", "48k", "-b", "16", "-e", "signed-integer", "rnn.raw", "-t", "wav", "rnn.wav"]) # convert raw back to wav

    subprocess.run(["mkdir", "-p", str(target_dir)])
    subprocess.run(["sox", "rnn.wav", str(target_filepath), "remix", "-", "highpass", "100", "lowpass", "7000", "rate", "22050"]) # apply high/low pass filter and change sr to 22050Hz

    data, rate = sf.read(target_filepath)

    # peak normalize audio to -1 dB
    peak_normalized_audio = pyln.normalize.peak(data, -1.0)

    # measure the loudness first
    meter = pyln.Meter(rate) # create BS.1770 meter
    loudness = meter.integrated_loudness(data)

    # loudness normalize audio to -25 dB LUFS
    loudness_normalized_audio = pyln.normalize.loudness(data, loudness, -25.0)

    sf.write(target_filepath, data=loudness_normalized_audio, samplerate=22050)

    print("")

要使用上面的脚本,您需要检查并构建RNNoise 项目::>

To use the script above, you will need to check out and build the RNNoise project:

sudo apt update
sudo apt-get install build-essential autoconf automake gdb git libffi-dev zlib1g-dev libssl-dev

git clone https://github.com/xiph/rnnoise.git
cd rnnoise
./autogen.sh
./configure
make

您还需要安装 SoX:

You will also need SoX installed:

sudo apt install sox

而且,您需要通过 ./bin/pip 安装 pyloudnorm.

And, you will need to install pyloudnorm via ./bin/pip.

然后,只需自定义脚本,使rnn指向rnnoise_demo命令的路径(构建RNNoise后,您可以在examples 文件夹).然后,运行脚本,将源路径(您拥有 WAV 文件的文件夹)作为第一个命令行参数传递.确保单词original"出现在路径中的某处.脚本会自动将转换后的文件放到对应的路径下,将original改为converted例如,如果你的源路径是/path/to/files/original,脚本会自动将转换后的结果放到/path/to/files/converted.

Then, just customize the script so that rnn points to the path of the rnnoise_demo command (after building RNNoise, you can find it in the examples folder). Then, run the script, passing the source path -- the folder where you have your WAV files -- as the first command line argument. Make sure that the word "original" appears somewhere in the path. The script will automatically place the converted files in a corresponding path, with original changed to converted; for example, if your source path is /path/to/files/original, the script will automatically place the converted results in /path/to/files/converted.

Mozilla TTS 支持多种不同的数据加载器,但最常见的一种是 LJSpeech.要使用它,我们可以组织我们的数据集以遵循 LJSpeech 约定.

Mozilla TTS supports several different data loaders, but one of the most common is LJSpeech. To use it, we can organize our data set to follow LJSpeech conventions.

首先,组织您的文件,使其具有如下结构:

First, organize your files so that you have a structure like this:

- metadata.csv
- wavs/
  - audio1.wav
  - audio2.wav
  ...
  - last_audio.wav

音频文件的命名似乎并不重要.但是,文件必须位于名为wavs的文件夹中.如果需要,您可以在 wavs 中使用子文件夹.

The naming of the audio files doesn't appear to be significant. But, the files must be in a folder called wavs. You can use sub-folders inside wavs though, if so desired.

metadata.csv 文件应采用以下格式:

The metadata.csv file should be in the following format:

audio1|line that's spoken in the first file
audio2|line that's spoken in the second file
last_audio|line that's spoken in the last file

注意:

  • 没有标题行.
  • 列用竖线符号 (|) 连接在一起.
  • 每个 WAV 文件应该有一行.
  • WAV 文件名在第一列中,没有 wavs/ 文件夹前缀,也没有 .wav 后缀.
  • WAV 中所说内容的文字描述写在第二列中,并拼写出所有数字和缩写.
  • There is no header line.
  • The columns are joined together with a pipe symbol (|).
  • There should be one row per WAV file.
  • The WAV filename is in the first column, without the wavs/ folder prefix, and without the .wav suffix.
  • The textual description of what's spoken in the WAV is written out in the second column, with all numbers and abbreviations spelled-out.

(我确实观察到 Mozilla TTS 文档中的步骤让您将元数据文件打乱,然后将其拆分为训练"集 (metadata_train.csv) 和验证".设置 (metadata_val.csv),但在 repo 中提供的任何示例配置实际上都没有配置为使用这些文件.我已经提交了 一个问题,因为它让初学者感到困惑/违反直觉.)

(I did observe that steps in the documentation for Mozilla TTS have you then shuffle the metadata file and then split it into a "training" set (metadata_train.csv) and "validation" set (metadata_val.csv), but none of the sample configs provided in the repo are actually configured to use these files. I've filed an issue about that because it's confusing/counter-intuitive to a beginner.)

您需要准备一个配置文件,描述如何配置您的自定义 TTS.在准备训练、执行训练和从自定义 TTS 生成音频时,Mozilla TTS 的多个部分使用此文件.不幸的是,尽管此文件非常重要,但 Mozilla TTS 的文档在很大程度上掩盖了如何自定义此文件.

You need to prepare a configuration file that describes how your custom TTS will be configured. This file is used by multiple parts of Mozilla TTS when preparing for training, performing training, and generating audio from your custom TTS. Unfortunately, though this file is very important, the documentation for Mozilla TTS largely glosses over how to customize this file.

首先,创建一个 默认 Tacotron 的副本config.json 文件 来自 Mozilla 存储库.然后,请务必至少自定义 audio.stats_pathoutput_pathphoneme_cache_pathdatasets.path 文件.

To start, create a copy of the default Tacotron config.json file from the Mozilla repo. Then, be sure to customize at least the audio.stats_path, output_path, phoneme_cache_path, and datasets.path file.

如果您愿意,您可以自定义其他参数,但默认值是一个不错的起点.例如,您可以更改 run_name 以控制包含数据集的文件夹的命名.

You can customize other parameters if you so choose, but the defaults are a good place to start. For example, you can change the run_name to control the naming of folders containing your datasets.

不要更改 datasets.name 参数(将其设置为ljspeech");否则你会得到与未定义数据集类型相关的奇怪错误. 数据集 name 似乎指的是使用的 data loader 的类型,而不是什么你调用你的数据集.同样,我没有冒险更改 model 设置,因为我还不知道系统如何使用该值.

Do not change the datasets.name parameter (leave it set to "ljspeech"); otherwise you'll get strange errors related to an undefined dataset type. It appears that the dataset name refers to the type of data loader used, rather than what you call your data set. Similarly, I haven't risked changing the model setting, since I don't yet know how that value gets used by the system.

大多数训练配置依赖于基于训练集生成的名为 scale_stats.npy 的统计文件.您可以使用 Mozilla TTS 存储库中的 ./TTS/bin/compute_statistics.py 脚本来生成此文件.此脚本需要您的 config.json 文件作为输入,这是检查到目前为止一切正常的良好步骤.

Most of the training configurations rely on a statistics file called scale_stats.npy that's generated based on the training set. You can use the ./TTS/bin/compute_statistics.py script inside the Mozilla TTS repo to generate this file. This script requires your config.json file as an input, and is a good step to sanity check that everything looks good up to this point.

如果您在本教程开始时创建的 Mozilla TTS 文件夹中,您可以运行以下命令示例(调整路径以适合您的项目):

Here's an example of a command you can run if you are inside the Mozilla TTS folder you created at the start of this tutorial (adjust paths to fit your project):

./bin/python ./TTS/bin/compute_statistics.py --config_path /path/to/your/project/config.json --out_path /path/to/your/project/scale_stats.npy

如果成功,这将在 /path/to/your/project/scale_stats.npy 下生成一个 scale_stats.npy 文件.确保config.json 文件的audio.stats_path 设置中的路径与此路径匹配.

If successful, this will generate a scale_stats.npy file under /path/to/your/project/scale_stats.npy. Be sure that the path in the audio.stats_path setting of your config.json file matches this path.

现在是关键时刻——是时候开始训练您的模型了!

It's now time for the moment of truth -- it's time to start training your model!

如果您在本教程开始时创建的 Mozilla TTS 文件夹中,您可以运行以下命令来训练 Tacotron 模型(调整路径以适合您的项目):

Here's an example of a command you can run to train a Tacotron model if you are inside the Mozilla TTS folder you created at the start of this tutorial (adjust paths to fit your project):

./bin/python ./TTS/bin/train_tacotron.py --config_path /path/to/your/project/config.json

这个过程需要几个小时,甚至几天.如果您的机器支持 CUDA 并对其进行了正确配置,则该过程将比仅依靠 CPU 运行得更快.

This process will take several hours, if not days. If your machine supports CUDA and has it properly configured, the process will run more quickly than if you are just relying on CPU alone.

如果您收到与信号错误"相关的任何错误,或收到信号",这通常表示您的机器没有足够的内存来进行操作.您可以以较少的并行度运行训练,但运行速度会慢得多.

If you get any errors related to a "signal error" or "signal received", this typically indicates that your machine does not have enough memory for the operation. You can run the training with less parallelism but it will run much more slowly.

这篇关于如何在 Ubuntu 20.04 上开始使用 Mozilla TTS 训练自定义语音模型?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆