如何使用Azure语音转文本和Python SDK获取单词级时间戳? [英] How to get Word Level Timestamps using Azure Speech to Text and the Python SDK?

查看:194
本文介绍了如何使用Azure语音转文本和Python SDK获取单词级时间戳?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我在GitHub上找到的示例的帮助下,我的代码当前能够读取音频文件并使用Azure Speech to Text进行转录.但是,我需要在转录中包括所有单词的时间戳.根据文档,此功能已在1.5.0版中添加,可以通过request_word_level_timestamps()方法进行访问.但是即使我已经打电话给我,我也会得到与以前相同的答复.我无法从文档中弄清楚如何使用它.有谁知道它是如何工作的?

My code currently is able to read an audio file and transcribe it using Azure Speech to Text, with help from an example that I found on GitHub. However, I need to include the timestamps for all the words in the transcription. According to the documentation, this functionality was added in version 1.5.0, and is accessed through the method request_word_level_timestamps(). But even when I have called it, I get the same response as before. I cannot figure out how to use it from the documentation. Does anyone know how it works?

我正在使用Python SDK 1.5.1版.

I'm using Python SDK version 1.5.1.

import azure.cognitiveservices.speech as speechsdk
import time
from allennlp.predictors.predictor import Predictor
import json 

inputPath = "(inputlocation)"
outputPath = "(outputlocation)"

# Creates an instance of a speech config with specified subscription     key and service region.
# Replace with your own subscription key and service region (e.g., "westus").
speech_key, service_region = "apikey", "region"
speech_config = speechsdk.SpeechConfig(subscription=speech_key,     region=service_region)
speech_config.request_word_level_timestamps()
speech_config.output_format=speechsdk.OutputFormat.Detailed
#print("VALUE: " +     speech_config.get_property(property_id=speechsdk.PropertyId.SpeechServic    eResponse_RequestWordLevelTimestamps))
filename = input("Enter filename: ")

print(speech_config)

try:
    audio_config = speechsdk.audio.AudioConfig(filename= inputPath +     filename)

    # Creates a recognizer with the given settings
    speech_recognizer =     speechsdk.SpeechRecognizer(speech_config=speech_config,     audio_config=audio_config)


def start():
    done = False
    #output = ""
    fileOpened = open(outputPath+ filename[0: len(filename) - 4] + "_MS_recognized.txt", "w+")
    fileOpened.truncate(0)
    fileOpened.close()

    def stop_callback(evt):
        print("Closing on {}".format(evt))
        speech_recognizer.stop_continuous_recognition()
        nonlocal done
        done = True

    def add_to_res(evt):
        #nonlocal output
        #print("Recognized: {}".format(evt.result.text))
        #output = output + evt.result.text + "\n"
        fileOpened = open( outputPath + filename[0: len(filename) - 4] + "_MS_recognized.txt", "a")
        fileOpened.write(evt.result.text + "\n")
        fileOpened.close()
        #print(output)

    # Connect callbacks to the events fired by the speech recognizer
    speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
    speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    speech_recognizer.recognized.connect(add_to_res)
    speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
    # stop continuous recognition on either session stopped or canceled events
    speech_recognizer.session_stopped.connect(stop_callback)
    speech_recognizer.canceled.connect(stop_callback)

    # Start continuous speech recognition
    speech_recognizer.start_continuous_recognition()
    while not done:
        time.sleep(.5)
    # </SpeechContinuousRecognitionWithFile>


    # Starts speech recognition, and returns after a single utterance is recognized. The end of a
    # single utterance is determined by listening for silence at the end or until a maximum of 15
    # seconds of audio is processed.  The task returns the recognition text as result. 
    # Note: Since recognize_once() returns only a single utterance, it is suitable only for single
    # shot recognition like command or query. 
    # For long-running multi-utterance recognition, use start_continuous_recognition() instead.

start()

except Exception as e: 
    print("File does not exist")
    #print(e)

结果仅包含session_id和一个结果对象,其中包含result_id,文本和原因.

The results only contain session_id and a result object which includes result_id, text and reason.

推荐答案

如果使用 request_word_level_timestamps()<,则可以将其作为连续识别来运行.您可以使用 evt.result.json 检查json结果.

Per a comment on how will it help for continuous recognition, if you set up the SpeechConfig with request_word_level_timestamps(), you can run this as continuous recognition. You can inspect the json results with evt.result.json.

例如,

speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
speech_config.request_word_level_timestamps()

然后是您的语音识别器:

then your speech recognizer:

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

将回调连接到speech_recognizer触发的事件时,可以看到带有以下字词的时间戳:

When you're connecting callbacks to the events fired by the speech_recognizer, you can see word-level timestamps with:

speech_recognizer.recognized.connect(lambda evt: print('JSON: {}'.format(evt.result.json)))

我的问题是Translation对象不包含单词级别,因为它不接受 speech_config .

My issue is that the Translation object doesn't contain word-level as it doesn't accept a speech_config.

这篇关于如何使用Azure语音转文本和Python SDK获取单词级时间戳?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆