使用Java SDK将音频从麦克风流传输到IBM Watson SpeechToText Web服务 [英] Stream audio from mic to IBM Watson SpeechToText Web service using Java SDK

查看:201
本文介绍了使用Java SDK将音频从麦克风流传输到IBM Watson SpeechToText Web服务的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

尝试使用Java SDK将来自麦克风的连续音频流直接发送到IBM Watson SpeechToText Web服务.发行版(RecognizeUsingWebSocketsExample)提供的示例之一显示了如何将.WAV格式的文件流式传输到服务.但是,.WAV文件要求提前指定其长度,因此仅将一个缓冲区一次附加到文件的幼稚方法是不可行的.

Trying to send a continuous audio stream from microphone directly to IBM Watson SpeechToText Web service using the Java SDK. One of the examples provided with the distribution (RecognizeUsingWebSocketsExample) shows how to stream a file in .WAV format to the service. However, .WAV files require that their length be specified ahead of time, so the naive approach of just appending to the file one buffer at a time is not feasible.

似乎SpeechToText.recognizeUsingWebSocket可以获取流,但是向它提供AudioInputStream的实例似乎不行,好像建立了连接,但是即使RecognizeOptions.interimResults(true)也没有返回任何脚本.

It appears that SpeechToText.recognizeUsingWebSocket can take a stream, but feeding it an instance of AudioInputStream does not seem to do it appears like the connection is established but no transcripts are returned even though RecognizeOptions.interimResults(true).

public class RecognizeUsingWebSocketsExample {
private static CountDownLatch lock = new CountDownLatch(1);

public static void main(String[] args) throws FileNotFoundException, InterruptedException {
SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");

AudioInputStream audio = null;

try {
    final AudioFormat format = new AudioFormat(16000, 16, 1, true, false);
    DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
    TargetDataLine line;
    line = (TargetDataLine)AudioSystem.getLine(info);
    line.open(format);
    line.start();
    audio = new AudioInputStream(line);
    } catch (LineUnavailableException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

RecognizeOptions options = new RecognizeOptions.Builder()
    .continuous(true)
    .interimResults(true)
    .contentType(HttpMediaType.AUDIO_WAV)
    .build();

service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() {
  @Override
  public void onTranscription(SpeechResults speechResults) {
    System.out.println(speechResults);
    if (speechResults.isFinal())
      lock.countDown();
  }
});

lock.await(1, TimeUnit.MINUTES);
}
}

任何帮助将不胜感激.

-rg

以下是根据德文的评论进行的更新(谢谢).

Here's an update based on German's comment below (thanks for that).

我能够使用 javaFlacEncode 将从麦克风到达的WAV流隐藏到FLAC流并将其保存到临时文件中.与WAV音频文件(其大小在创建时是固定的)不同,可以将FLAC文件轻松附加到文件中.

I was able to use javaFlacEncode to covert the WAV stream arriving from the mic into a FLAC stream and save it into a temporary file. Unlike a WAV audio file, whose size is fixed at creation, the FLAC file can be appended to easily.

    WAV_audioInputStream = new AudioInputStream(line);
    FileInputStream FLAC_audioInputStream = new FileInputStream(tempFile);

    StreamConfiguration streamConfiguration = new StreamConfiguration();
    streamConfiguration.setSampleRate(16000);
    streamConfiguration.setBitsPerSample(8);
    streamConfiguration.setChannelCount(1);

    flacEncoder = new FLACEncoder();
    flacOutputStream = new FLACFileOutputStream(tempFile);  // write to temp disk file

    flacEncoder.setStreamConfiguration(streamConfiguration);
    flacEncoder.setOutputStream(flacOutputStream);

    flacEncoder.openFLACStream();

    ...
    // convert data
    int frameLength = 16000;
    int[] intBuffer = new int[frameLength];
    byte[] byteBuffer = new byte[frameLength];

    while (true) {
        int count = WAV_audioInputStream.read(byteBuffer, 0, frameLength);
        for (int j1=0;j1<count;j1++)
            intBuffer[j1] = byteBuffer[j1];

        flacEncoder.addSamples(intBuffer, count);
        flacEncoder.encodeSamples(count, false);  // 'false' means non-final frame
    }

    flacEncoder.encodeSamples(flacEncoder.samplesAvailableToEncode(), true);  // final frame
    WAV_audioInputStream.close();
    flacOutputStream.close();
    FLAC_audioInputStream.close();

添加任意数量的帧后,可以分析结果文件(使用curlrecognizeUsingWebSocket())而没有任何问题.但是,即使文件的最后一帧可能不是最终的(即,在encodeSamples(count, false)之后),recognizeUsingWebSocket()也会在到达FLAC文件末尾时立即返回最终结果.

The resulting file can be analyzed (using curl or recognizeUsingWebSocket()) without any problems after adding an arbitrary number of frames. However, the recognizeUsingWebSocket() will return the final result as soon as it reaches the end of the FLAC file, even though the file's last frame may not be final (i.e., after encodeSamples(count, false)).

我希望recognizeUsingWebSocket()阻塞直到将最后一帧写入文件.实际上,这意味着分析在第一帧之后停止,因为分析第一帧所需的时间少于收集第二帧所需的时间,因此返回结果后,文件的末尾便到达了.

I would expect recognizeUsingWebSocket() to block till the final frame is written to the file. In practical terms, it means that the analysis stops after the first frame, as it takes less time to analyze the first frame than to collect the 2nd, so upon returning the results, the end of file is reached.

这是在Java中实现来自麦克风的流音频的正确方法吗?似乎是一个常见的用例.

Is this the right way to implement streaming audio from a mic in Java? Seems like a common use case.

这是对RecognizeUsingWebSocketsExample的修改,其中包含以下Daniel的一些建议.它使用PCM内容类型(作为String传递,并带有帧大小),并尝试向音频流的结尾发出信号,尽管不是很成功.

Here's a modification of RecognizeUsingWebSocketsExample, incorporating some of Daniel's suggestions below. It uses PCM content type (passed as a String, together with a frame size), and an attempt to signal the end of the audio stream, albeit not a very successful one.

和以前一样,建立了连接,但是从不调用识别回调.关闭流似乎也不会被解释为音频的结尾.我一定在误会这里的事情...

As before, the connection is made, but the recognize callback is never called. Closing the stream does not seem to be interpreted as an end of audio either. I must be misunderstanding something here...

    public static void main(String[] args) throws IOException, LineUnavailableException, InterruptedException {

    final PipedOutputStream output = new PipedOutputStream();
    final PipedInputStream  input  = new PipedInputStream(output);

  final AudioFormat format = new AudioFormat(16000, 8, 1, true, false);
  DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
  final TargetDataLine line = (TargetDataLine)AudioSystem.getLine(info);
  line.open(format);
  line.start();

    Thread thread1 = new Thread(new Runnable() {
        @Override
        public void run() {
            try {
              final int MAX_FRAMES = 2;
              byte buffer[] = new byte[16000];
              for(int j1=0;j1<MAX_FRAMES;j1++) {  // read two frames from microphone
              int count = line.read(buffer, 0, buffer.length);
              System.out.println("Read audio frame from line: " + count);
              output.write(buffer, 0, buffer.length);
              System.out.println("Written audio frame to pipe: " + count);
              }
              /** no need to fake end-of-audio;  StopMessage will be sent 
              * automatically by SDK once the pipe is drained (see WebSocketManager)
              // signal end of audio; based on WebSocketUploader.stop() source
              byte[] stopData = new byte[0];
              output.write(stopData);
              **/
            } catch (IOException e) {
            }
        }
    });
    thread1.start();

  final CountDownLatch lock = new CountDownLatch(1);

  SpeechToText service = new SpeechToText();
  service.setUsernameAndPassword("<username>", "<password>");

  RecognizeOptions options = new RecognizeOptions.Builder()
  .continuous(true)
  .interimResults(false)
  .contentType("audio/pcm; rate=16000")
  .build();

  service.recognizeUsingWebSocket(input, options, new BaseRecognizeCallback() {
    @Override
    public void onConnected() {
      System.out.println("Connected.");
    }
    @Override
    public void onTranscription(SpeechResults speechResults) {
    System.out.println("Received results.");
      System.out.println(speechResults);
      if (speechResults.isFinal())
        lock.countDown();
    }
  });

  System.out.println("Waiting for STT callback ... ");

  lock.await(5, TimeUnit.SECONDS);

  line.stop();

  System.out.println("Done waiting for STT callback.");

}


Dani,我检测了WebSocketManager的源代码(SDK附带),并用显式的StopMessage有效负载替换了对sendMessage()的调用,如下所示:


Dani, I instrumented the source for WebSocketManager (comes with SDK) and replaced a call to sendMessage() with an explicit StopMessage payload as follows:

        /**
     * Send input steam.
     *
     * @param inputStream the input stream
     * @throws IOException Signals that an I/O exception has occurred.
     */
    private void sendInputSteam(InputStream inputStream) throws IOException {
      int cumulative = 0;
      byte[] buffer = new byte[FOUR_KB];
      int read;
      while ((read = inputStream.read(buffer)) > 0) {
        cumulative += read;
        if (read == FOUR_KB) {
          socket.sendMessage(RequestBody.create(WebSocket.BINARY, buffer));
        } else {
          System.out.println("completed sending " + cumulative/16000 + " frames over socket");
          socket.sendMessage(RequestBody.create(WebSocket.BINARY, Arrays.copyOfRange(buffer, 0, read)));  // partial buffer write
          System.out.println("signaling end of audio");
          socket.sendMessage(RequestBody.create(WebSocket.TEXT, buildStopMessage().toString()));  // end of audio signal

        }

      }
      inputStream.close();
    }

sendMessage()选项(发送长度为0的二进制内容或发送停止文本消息)似乎都不起作用.呼叫者代码从上方不变.结果输出为:

Neither of sendMessage() options (sending 0-length binary content or sending the stop text message) seems to work. The caller code is unchanged from above. The resulting output is:

Waiting for STT callback ... 
Connected.
Read audio frame from line: 16000
Written audio frame to pipe: 16000
Read audio frame from line: 16000
Written audio frame to pipe: 16000
completed sending 2 frames over socket
onFailure: java.net.SocketException: Software caused connection abort: socket write error

已修订:实际上,从未到达音频结束通话.将最后一个(部分)缓冲区写入套接字时会引发异常.

REVISED: actually, the end-of-audio call is never reached. Exception is thrown while writing the last (partial) buffer to the socket.

为什么连接中止?通常在对等方关闭连接时发生.

Why is the connection aborted? That typically happens when the peer closes the connection.

关于第2点):在这个阶段,这两个都重要吗?似乎根本没有开始识别过程……音频是有效的(正如我在上面指出的那样,我将流写到磁盘上并能够通过从文件流式传输来识别它).

As for point 2): Would either of these matter at this stage? It appears that recognition process is not being started at all... Audio is valid (I wrote the stream out to a disk and was able to recognize it by streaming it from a file, as I point out above).

此外,在进一步查看WebSocketManager源代码时,onMessage()已经在return之后立即从sendInputSteam()发送StopMessage(即,在上述示例中的音频流或管道中,排水),因此无需显式调用它.该问题肯定是在音频数据传输完成之前发生的.无论将PipedInputStream还是AudioInputStream作为输入传递,其行为都是相同的.在两种情况下发送二进制数据时都会引发异常.

Also, on further review of WebSocketManager source code, onMessage() already sends StopMessage immediately upon return from sendInputSteam() (ie.e., when the audio stream, or pipe in the example above, drains), so no need to call it explicitly. The problem is definitely occurring before the audio data transmission completes. The behavior is the same, regardless if PipedInputStream or AudioInputStream is passed as input. Exception is thrown while sending binary data in both cases.

推荐答案

Java SDK提供了一个示例并支持该示例.

The Java SDK has an example and supports this.

使用以下命令更新您的pom.xml:

Update your pom.xml with:

 <dependency>
   <groupId>com.ibm.watson.developer_cloud</groupId>
   <artifactId>java-sdk</artifactId>
   <version>3.3.1</version>
 </dependency>

这里是如何收听麦克风的示例.

Here is an example of how to listen to your microphone.

SpeechToText service = new SpeechToText();
service.setUsernameAndPassword("<username>", "<password>");

// Signed PCM AudioFormat with 16kHz, 16 bit sample size, mono
int sampleRate = 16000;
AudioFormat format = new AudioFormat(sampleRate, 16, 1, true, false);
DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);

if (!AudioSystem.isLineSupported(info)) {
  System.out.println("Line not supported");
  System.exit(0);
}

TargetDataLine line = (TargetDataLine) AudioSystem.getLine(info);
line.open(format);
line.start();

AudioInputStream audio = new AudioInputStream(line);

RecognizeOptions options = new RecognizeOptions.Builder()
  .continuous(true)
  .interimResults(true)
  .timestamps(true)
  .wordConfidence(true)
  //.inactivityTimeout(5) // use this to stop listening when the speaker pauses, i.e. for 5s
  .contentType(HttpMediaType.AUDIO_RAW + "; rate=" + sampleRate)
  .build();

service.recognizeUsingWebSocket(audio, options, new BaseRecognizeCallback() {
  @Override
  public void onTranscription(SpeechResults speechResults) {
    System.out.println(speechResults);
  }
});

System.out.println("Listening to your voice for the next 30s...");
Thread.sleep(30 * 1000);

// closing the WebSockets underlying InputStream will close the WebSocket itself.
line.stop();
line.close();

System.out.println("Fin.");

这篇关于使用Java SDK将音频从麦克风流传输到IBM Watson SpeechToText Web服务的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆