使用Microsoft Cognitive Speech API和非麦克风实时音频流进行语音识别 [英] Speech recognition with Microsoft Cognitive Speech API and non-microphone real-time audio stream

查看:132
本文介绍了使用Microsoft Cognitive Speech API和非麦克风实时音频流进行语音识别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题

我的项目由一个桌面应用程序组成,该应用程序实时记录音频,为此,我打算从API接收实时识别反馈.使用麦克风,使用Microsoft的新语音到文本API的实时实现是微不足道的,我的情况与仅在将我的数据写入 MemoryStream 对象.

API支持

我可以使用不同方法的建议来实现,因为我怀疑将记录的 DataAvailable 事件与向API实际发送数据相关联时会出现一些时序问题,这使其无法使用会话过早.没有关于为什么我的请求失败的详细反馈,我只能猜测原因.

解决方案

如果没有立即可用的数据,则应阻止 PullAudioInputStream Read()回调.并且 Read()仅在流到达末尾时才返回0.然后 Read()返回0后,SDK将关闭流(查找API参考文档This article explains how to implement the API's Recognizer (link) with custom audio streams, which invariably requires the implementation of the abstract class PullAudioInputStream (link) in order to create the required AudioConfig object using the CreatePullStream method (link). In other words, to achieve what I require, a callback interface must be implemented.

Implementation attempt

Since my data is written to a MemoryStream (and the library I use will only record to files or Stream objects), in the code below I simply copy over the buffer to the implemented class (in a sloppy way, perhaps?) resolving the divergence in method signatures.

class AudioInputCallback : PullAudioInputStreamCallback
{
    private readonly MemoryStream memoryStream;

    public AudioInputCallback(MemoryStream stream)
    {
        this.memoryStream = stream;
    }

    public override int Read(byte[] dataBuffer, uint size)
    {
        return this.Read(dataBuffer, 0, dataBuffer.Length);
    }

    private int Read(byte[] buffer, int offset, int count)
    {
        return memoryStream.Read(buffer, offset, count);
    }

    public override void Close()
    {
        memoryStream.Close();
        base.Close();
    }

}

The Recognizer implementation is as follows:

private SpeechRecognizer CreateMicrosoftSpeechRecognizer(MemoryStream memoryStream)
{
    var recognizerConfig = SpeechConfig.FromSubscription(SubscriptionKey, @"westus");
    recognizerConfig.SpeechRecognitionLanguage =
        _programInfo.CurrentSourceCulture.TwoLetterISOLanguageName;

    // Constants are used as constructor params)
    var format = AudioStreamFormat.GetWaveFormatPCM(
        samplesPerSecond: SampleRate, bitsPerSample: BitsPerSample, channels: Channels);

    // Implementation of PullAudioInputStreamCallback
    var callback = new AudioInputCallback(memoryStream);
    AudioConfig audioConfig = AudioConfig.FromStreamInput(callback, format);

    //Actual recognizer is created with the required objects
    SpeechRecognizer recognizer = new SpeechRecognizer(recognizerConfig, audioConfig);

    // Event subscriptions. Most handlers are implemented for debugging purposes only.
    // A log window outputs the feedback from the event handlers.
    recognizer.Recognized += MsRecognizer_Recognized;
    recognizer.Recognizing += MsRecognizer_Recognizing;
    recognizer.Canceled += MsRecognizer_Canceled;
    recognizer.SpeechStartDetected += MsRecognizer_SpeechStartDetected;
    recognizer.SpeechEndDetected += MsRecognizer_SpeechEndDetected;
    recognizer.SessionStopped += MsRecognizer_SessionStopped;
    recognizer.SessionStarted += MsRecognizer_SessionStarted;

    return recognizer;
}

How the data is made available to the recognizer (using CSCore):

MemoryStream memoryStream = new MemoryStream(_finalSource.WaveFormat.BytesPerSecond / 2);
byte[] buffer = new byte[_finalSource.WaveFormat.BytesPerSecond / 2];

_soundInSource.DataAvailable += (s, e) =>
{
    int read;
    _programInfo.IsDataAvailable = true;

    // Writes to MemoryStream as event fires
    while ((read = _finalSource.Read(buffer, 0, buffer.Length)) > 0)
        memoryStream.Write(buffer, 0, read);
};

// Creates MS recognizer from MemoryStream
_msRecognizer = CreateMicrosoftSpeechRecognizer(memoryStream);

//Initializes loopback capture instance
_soundIn.Start();

await Task.Delay(1000);

// Starts recognition
await _msRecognizer.StartContinuousRecognitionAsync();

Outcome

When the application is run, I don't get any exceptions, nor any response from the API other than SessionStarted and SessionStopped, as depicted below in the log window of my application.

I could use suggestions of different approaches to my implementation, as I suspect there is some timing problem in tying the recorded DataAvailable event with the actual sending of data to the API, which is making it discard the session prematurely. With no detailed feedback on why my requests are unsuccessful, I can only guess at the reason.

解决方案

The Read() callback of PullAudioInputStream should block if there is no data immediate available. And Read() returns 0, only if the stream reaches the end. The SDK will then close the stream after Read() returns 0 (find an API reference doc here).

However, the behavior of Read() of C# MemoryStream is different: It returns 0 if there is no data available in the buffer. This is why you only see SessionStart and SessionStop events, but no recognition events.

In order to fix that, you need to add some kind of synchronization between PullAudioInputStream::Read() and MemoryStream::Write(), in order to make sure that PullAudioInputStream::Read() will wait until MemoryStream::Write() writes some data into buffer.

Alternatively, I would recommend to use PushAudioInputStream, which allows you directly write your data into stream. For your case, in _soundSource.DataAvailable event, instead of writing data into MemoryStream, you can directly write it into PushAudioInputStream. You can find samples for PushAudioInputStream here.

We will update the documentation in order to provide the best practice on how to use Pull and Push AudioInputStream. Sorry for the inconvenience.

Thank you!

这篇关于使用Microsoft Cognitive Speech API和非麦克风实时音频流进行语音识别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆