自然语言中断的Split String [英] Split String at natural language breaks

查看:84
本文介绍了自然语言中断的Split String的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

概述

Overview

我将字符串发送到文本到语音服务器,该服务器接受最大长度为300个字符。由于网络延迟,返回的每个语句段之间可能会有延迟,因此我希望尽可能在最自然暂停时打破语音。

I send Strings to a Text-to-Speech server that accepts a maximum length of 300 characters. Due to network latency, there may be a delay between each section of speech being returned, so I'd like to break the speech up at the most 'natural pauses' wherever possible.

每个服务器请求都花了我钱,所以理想情况下我会发送最长的字符串,最多允许的字符数。

Each server request costs me money, so ideally I'd send the longest string possible, up to the maximum allowed characters.

这是我目前的实现:

private static final boolean DEBUG = true;

private static final int MAX_UTTERANCE_LENGTH = 298;
private static final int MIN_UTTERANCE_LENGTH = 200;

private static final String FULL_STOP_SPACE = ". ";
private static final String QUESTION_MARK_SPACE = "? ";
private static final String EXCLAMATION_MARK_SPACE = "! ";
private static final String LINE_SEPARATOR = System.getProperty("line.separator");
private static final String COMMA_SPACE = ", ";
private static final String JUST_A_SPACE = " ";

public static ArrayList<String> splitUtteranceNaturalBreaks(String utterance) {

    final long then = System.nanoTime();

    final ArrayList<String> speakableUtterances = new ArrayList<String>();

    int splitLocation = 0;
    String success = null;

    while (utterance.length() > MAX_UTTERANCE_LENGTH) {

        splitLocation = utterance.lastIndexOf(FULL_STOP_SPACE, MAX_UTTERANCE_LENGTH);

        if (DEBUG) {
            System.out.println("(0 FULL STOP) - last index at: " + splitLocation);
        }

        if (splitLocation < MIN_UTTERANCE_LENGTH) {
            if (DEBUG) {
                System.out.println("(1 FULL STOP) - NOT_OK");
            }

            splitLocation = utterance.lastIndexOf(QUESTION_MARK_SPACE, MAX_UTTERANCE_LENGTH);

            if (DEBUG) {
                System.out.println("(1 QUESTION MARK) - last index at: " + splitLocation);
            }

            if (splitLocation < MIN_UTTERANCE_LENGTH) {
                if (DEBUG) {
                    System.out.println("(2 QUESTION MARK) - NOT_OK");
                }

                splitLocation = utterance.lastIndexOf(EXCLAMATION_MARK_SPACE, MAX_UTTERANCE_LENGTH);

                if (DEBUG) {
                    System.out.println("(2 EXCLAMATION MARK) - last index at: " + splitLocation);
                }

                if (splitLocation < MIN_UTTERANCE_LENGTH) {
                    if (DEBUG) {
                        System.out.println("(3 EXCLAMATION MARK) - NOT_OK");
                    }

                    splitLocation = utterance.lastIndexOf(LINE_SEPARATOR, MAX_UTTERANCE_LENGTH);

                    if (DEBUG) {
                        System.out.println("(3 SEPARATOR) - last index at: " + splitLocation);
                    }

                    if (splitLocation < MIN_UTTERANCE_LENGTH) {
                        if (DEBUG) {
                            System.out.println("(4 SEPARATOR) - NOT_OK");
                        }

                        splitLocation = utterance.lastIndexOf(COMMA_SPACE, MAX_UTTERANCE_LENGTH);

                        if (DEBUG) {
                            System.out.println("(4 COMMA) - last index at: " + splitLocation);
                        }

                        if (splitLocation < MIN_UTTERANCE_LENGTH) {
                            if (DEBUG) {
                                System.out.println("(5 COMMA) - NOT_OK");
                            }

                            splitLocation = utterance.lastIndexOf(JUST_A_SPACE, MAX_UTTERANCE_LENGTH);

                            if (DEBUG) {
                                System.out.println("(5 SPACE) - last index at: " + splitLocation);
                            }

                            if (splitLocation < MIN_UTTERANCE_LENGTH) {
                                if (DEBUG) {
                                    System.out.println("(6 SPACE) - NOT_OK");
                                }

                                splitLocation = MAX_UTTERANCE_LENGTH;

                                if (DEBUG) {
                                    System.out.println("(6 MAX_UTTERANCE_LENGTH) - last index at: " + splitLocation);
                                }

                            } else {
                                if (DEBUG) {
                                    System.out.println("Accepted");
                                }

                                splitLocation -= 1;
                            }
                        }
                    } else {
                        if (DEBUG) {
                            System.out.println("Accepted");
                        }

                        splitLocation -= 1;
                    }
                } else {
                    if (DEBUG) {
                        System.out.println("Accepted");
                    }
                }
            } else {
                if (DEBUG) {
                    System.out.println("Accepted");
                }
            }
        } else {
            if (DEBUG) {
                System.out.println("Accepted");
            }
        }

        success = utterance.substring(0, (splitLocation + 2));

        speakableUtterances.add(success.trim());

        if (DEBUG) {
            System.out.println("Split - Length: " + success.length() + " -:- " + success);
            System.out.println("------------------------------");
        }

        utterance = utterance.substring((splitLocation + 2)).trim();
    }

    speakableUtterances.add(utterance);

    if (DEBUG) {

        System.out.println("Split - Length: " + utterance.length() + " -:- " + utterance);

        final long now = System.nanoTime();
        final long elapsed = now - then;

        System.out.println("ELAPSED: " + TimeUnit.MILLISECONDS.convert(elapsed, TimeUnit.NANOSECONDS));

    }

    return speakableUtterances;
}

由于无法在中使用正则表达式而丑陋lastIndexOf 。除了丑陋之外,它实际上非常快。

It's ugly due to being unable to use regex within lastIndexOf. Ugly aside, it's actually pretty fast.

问题

理想情况下我会喜欢使用正则表达式,允许匹配我的第一个选择分隔符:

Ideally I'd like to use regex that allows for a match on one of my first choice delimiters:

private static final String firstChoice = "[.!?" + LINE_SEPARATOR + "]\\s+";
private static final Pattern pFirstChoice = Pattern.compile(firstChoice);

然后使用匹配器解析仓位:

And then use a matcher to resolve the position:

    Matcher matcher = pFirstChoice.matcher(input);

    if (matcher.find()) {
        splitLocation = matcher.start();
    }

我当前实现的替代方法是存储每个分隔符的位置然后选择最近的 MAX_UTTERANCE_LENGTH

My alternative in my current implementation is to store the location of each delimiter and then select the nearest to MAX_UTTERANCE_LENGTH

我尝试了各种方法来应用 MIN_UTTERANCE_LENGTH & MAX_UTTERANCE_LENGTH 到模式,所以它只捕获这些值并使用lookarounds来反向迭代?< = ,但是这是我的知识开始让我失望的地方:

I've tried various methods to apply the MIN_UTTERANCE_LENGTH & MAX_UTTERANCE_LENGTH to the Pattern, so it only captures between these values and using lookarounds to reverse iterate ?<=, but this is where my knowledge starts to fail me:

private static final String poorEffort =([。!?] {200,298}) \\\\ +);

最后

我想知道你们中的任何一个正则表达式大师是否能达到我所追求的目标并确认实际上是否会更有效率?

I wonder if any of you regex masters can achieve what I'm after and confirm if in actual fact, it will prove more efficient?

我感谢你们提前。

参考文献:

  • Split a string at a natural break (Python)
  • Lookarounds
  • Regex to Split Tokens With Minimum Size and Delimiters

推荐答案

我会做这样的事情:

Pattern p = Pattern.compile(".{1,299}(?:[.!?]\\s+|\\n|$)", Pattern.DOTALL);
Matcher matcher = p.matcher(text);
while (matcher.find()) {
    speakableUtterances.add(matcher.group().trim());
}

正则表达式的说明:

.{1,299}                 any character between 1 and 299 times (matching the most amount possible)
(?:[.!?]\\s+|\\n|$)      followed by either .!? and whitespaces, a newline or the end of the string

您可以考虑将标点符号扩展为 \p {Punct} ,请参阅javadoc 模式

You could consider to extend the punctuation to \p{Punct}, see javadoc for Pattern.

您可以在 ideone

这篇关于自然语言中断的Split String的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆