从长字符串标记化有效字 [英] Tokenize valid words from a long string

查看:150
本文介绍了从长字符串标记化有效字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设你有一个包含有效字字典。

Suppose you have a dictionary that contains valid words.

由于去除了所有的空格输入字符串,确定字符串是否是由有效的话还是不行。

Given an input string with all spaces removed, determine whether the string is composed of valid words or not.

您可以假设该词典是一个哈希表,提供了O(1)查找。

You can assume the dictionary is a hashtable that provides O(1) lookup.

Some examples:

helloworld-> hello world (valid)
isitniceinhere-> is it nice in here (valid)
zxyy-> invalid

如果字符串有多个可能parsings,刚刚返回true就足够了。

If a string has multiple possible parsings, just return true is sufficient.

该字符串可能会很长。因此,认为一种算法,既是空间和放大器;时间效率。

The string can be very long. Hence think an algorithm that is both space & time efficient.

推荐答案

我觉得一套出现为有效字(从有限的字典采取的话)的串联所有字符串的形成规则的语言在字符的字母。然后,您可以建立一个接受正是你想要的琴弦有限自动机;计算时间为O(n)。

I think the set of all strings that occur as the concatenation of valid words (words taken from a finite dictionary) form a regular language over the alphabet of characters. You can then build a finite automaton that accepts exactly the strings you want; computation time is O(n).

例如,我们的词典包含的单词{蝙蝠,包}。然后,我们构建了以下自动机:(0,1,二),(1,2,一),(2,0,吨),(2,0,克):状态由0,1,2边记;其中三元组(X,Y,Z)指的边缘从x导致至y上输入ž。唯一的接受状态为0。在每个步骤中,在读取下一个输入符号,你算算一组指出可达上的投入。鉴于各国在自动机的数量是恒定的,这是一个复杂度为O(N)。至于空间的复杂性,我认为你可以用O(字数)与提示建设上面做的。

For instance, let the dictionary consist of the words {bat, bag}. Then we construct the following automaton: states are denoted by 0, 1, 2. Edges: (0,1,b), (1,2,a), (2,0,t), (2,0,g); where the triple (x,y,z) means an edge leading from x to y on input z. The only accepting state is 0. In each step, on reading the next input sign, you have to calculate the set of states that are reachable on that input. Given that the number of states in the automaton is constant, this is of complexity O(n). As for space complexity, I think you can do with O(number of words) with the hint for construction above.

有关的另一个例子,用的话{袋,蝙蝠,包子,但}自动机是这样的:

For an other example, with the words {bag, bat, bun, but} the automaton would look like this:

假设自动机已经建成(时间做这个事做的长度和字数:-)我们现在认为的时间来决定由自动字符串是否被接受为O( n),其中n是输入字符串的长度。 更确切地讲,我们的算法如下:

Supposing that the automaton has already been built (the time to do this has something to do with the length and number of words :-) we now argue that the time to decide whether a string is accepted by the automaton is O(n) where n is the length of the input string. More formally, our algorithm is as follows:

  1. 设S是一组状态,最初包含起始状态。
  2. 读取下一个输入字符,让我们表示它由一个。
  3. 对于每一个元素S在S,决定了我们进入从s的读取状态;即,状态R,使得与上面的符号(S,R,a)是一个边缘。让我们表示该组这些状态由R.即,R = {Rf | S在S,(S,R,a)是边缘}。
  4. (如果R是空的,该字符串不被接受,算法停止。)
  5. 如果没有更多的输入符号,检查是否有任何接受状态的是在R(在我们的情况下,只有一个接收状态,即开始状态)。如果是这样,该字符串被接受,如果不是,该字符串是不能接受的。
  6. 否则,取S:= R和去2

现在,还有这个周期的尽可能多的执行,因为有输入符号。我们必须检查的唯一的事情是,步骤3和5需要一定的时间。鉴于S和R的尺寸不大于状态自动机中,这是恒定的数目,并且我们可以存储在一个方式,使得查找时间是恒定的边缘,这如下。 (请注意,我们当然失去多个parsings,但这不是要求无论是。) 我认为这实际上是所谓的正规语言的成员资格问题,但我无法找到一个合适的在线参考。

Now, there are as many executions of this cycle as there are input symbols. The only thing we have to examine is that steps 3 and 5 take constant time. Given that the size of S and R is not greater than the number of states in the automaton, which is constant and that we can store edges in a way such that lookup time is constant, this follows. (Note that we of course lose multiple 'parsings', but that was not a requirement either.) I think this is actually called the membership problem for regular languages, but I couldn't find a proper online reference.

这篇关于从长字符串标记化有效字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆