如何在python录制时同时读取音频样本以将实时语音转换为文本? [英] How to simultaneously read audio samples while recording in python for real-time speech to text conversion?

查看:179
本文介绍了如何在python录制时同时读取音频样本以将实时语音转换为文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我已经训练了一些使用keras进行孤立单词识别的模型.目前,我可以使用声音设备录制功能录制音频一段固定的时间,并将音频文件另存为wav文件.我已经实现了静音检测,以修剪掉不需要的样本.但这在整个录制完成之后就可以正常工作.我想在同时录制的同时立即获得经过修剪的音频片段,以便可以实时进行语音识别.我正在使用python2和tensorflow 1.14.0.以下是我目前拥有的代码段,

Basically I have trained a few models using keras to do isolated word recognition. Currently i can record the audio using sound device record function for a pre-fixed duration and save the audio file as a wav file. I have implemented silence detection to trim out unwanted samples. But this is all working after the whole recording is complete. I would like to get the trimmed audio segments immediately while recording simultaneously so that i can do speech recognition in real-time. I'm using python2 and tensorflow 1.14.0. Below is the snippet of what i currently have,

import sounddevice as sd
import matplotlib.pyplot as plt
import time
#import tensorflow.keras.backend as K
import numpy as np 
from scipy.io.wavfile import write
from scipy.io.wavfile import read
from scipy.io import wavfile
from pydub import AudioSegment
import cv2
import tensorflow as tf
tf.compat.v1.enable_eager_execution()
tf.compat.v1.enable_v2_behavior()
from contextlib import closing
import multiprocessing 

models=['model1.h5','model2.h5','model3.h5','model4.h5','model5.h5']
loaded_models=[]

for model in models:
    loaded_models.append(tf.keras.models.load_model(model))

def prediction(model_ip):
    model,t=model_ip
    ret_val=model.predict(t).tolist()[0]
    return ret_val 

print("recording in 5sec")
time.sleep(5)
fs = 44100  # Sample rate
seconds = 10  # Duration of recording
print('recording')
time.sleep(0.5)
myrecording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
sd.wait()
thresh=0.025
gaplimit=9000
wav_file='/home/nick/Desktop/Endpoint/aud.wav'
write(wav_file, fs, myrecording)
fs,myrecording = read(wav_file)[0], read(wav_file)[1]
#Now the silence removal function is called which trims and saves only the useful audio samples in the form of a wav file. This trimmed audio contains the full word which can be recognized. 
end_points(wav_file,thresh,50)

#Below for loop combines the loaded models(I'm using multiple models) with the input in a tuple
for trimmed_aud in trimmed_audio:
    ...
    ... # The trimmed audio is processed further and the input which the model can predict 
          #is t 
    ...
    modelon=[]
    for md in loaded_models:
        modelon.append((md,t))
    start_time=time.time()
    with closing(multiprocessing.Pool()) as p:
        predops=p.map(prediction,modelon)
    print('Total time taken: {}'.format(time.time() - start_time))          
    actops=[]
    for predop in predops:
        actops.append(predop.index(max(predop)))
    print(actops)
    max_freqq = max(set(actops), key = actops.count) 
    final_ans+=str(max_freqq)
print("Output: {}".format(final_ans))

请注意,以上代码仅包含与问题相关的内容,不会运行.我想概述一下到目前为止的情况,非常感谢您提供的关于如何继续根据阈值同时记录和修剪音频的输入,以便在10秒的记录时间内说出多个单词(代码中的秒变量),正如我所说的那样,当窗口大小为50ms的样本能量低于某个阈值时,我会在这两个点处剪切音频,进行修整并将其用于预测.修剪的音频片段的记录和预测必须同时进行,以便每个输出字在记录10秒钟后发声后都可以立即显示.真的很感谢我如何处理此问题的任何建议.

Note that the above code only includes what is relevant to the question and will not run. I wanted to give an overview of what i have so far and would really appreciate your inputs on how i can proceed to be able to record and trim audio based on a threshold simultaneously so that if multiple words are spoken within the recording duration of 10 seconds(seconds variable in code), as i speak when the energy of the samples for a window size of 50ms goes below a certain threshold i cut the audio at those two points, trim and use it for prediction. Both recording and prediction of trimmed audio segments must happen simultaneously so that the each output word can be displayed immediately after its utterance during the 10 seconds of recording. Would really appreciate any suggestions on how I can go about this.

推荐答案

很难说出您的模型架构是什么,但是有些模型是专门为流识别而设计的.就像 Facebook的流式convnets 一样.但是,您将无法轻松地在Keras中实现它们.

Hard to say what your model architecture is but there are models specifically designed for streaming recognition. Like Facebook's streaming convnets. You won't be able to implement them in Keras easily though.

这篇关于如何在python录制时同时读取音频样本以将实时语音转换为文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆