匹配令牌序列 [英] Matching token sequences
问题描述
我有一组n个令牌(例如a,b,c)分布在其他令牌中.我想知道我集合中的所有成员是否都出现在给定的位置数(窗口大小)内.在我看来,可能可以编写RegEx来捕获此状态,但是确切的语法使我难以理解.
I have a set of n tokens (e.g., a, b, c) distributed among a bunch of other tokens. I would like to know if all members of my set occur within a given number of positions (window size). It occurred to me that it may be possible to write a RegEx to capture this state, but the exact syntax eludes me.
11111
012345678901234
ab ab bc a cba
在此示例中,给定窗口大小= 5,我想在位置12-14处匹配cba
,在位置3-7处匹配abc
.
In this example, given window size=5, I would like to match cba
at positions 12-14, and abc
in positions 3-7.
是否可以使用RegEx做到这一点,或者可以使用其他语法来捕获这种逻辑?
Is there a way to do this with RegEx, or is there some other kind of grammar that I can use to capture this logic?
我希望用Java来实现.
I am hoping to implement this in Java.
推荐答案
这是一个正则表达式,可匹配包含所有'a','b'和'c'的5个字母的序列:
Here's a regex that matches 5-letter sequences that include all of 'a', 'b' and 'c':
(?=.{0,4}a)(?=.{0,4}b)(?=.{0,4}c).{5}
因此,虽然基本上匹配任何5个字符(使用.{5}
),但必须遵守三个先决条件.它们每个都需要显示一个令牌/字母(最多4个字符,后跟"a"等). (?=X)
匹配"X,且正向宽度为零",零宽度表示匹配时字符位置不会移动.
So, while basically matching any 5 characters (with .{5}
), there are three preconditions the matches have to observe. Each of them requires one of the tokens/letters to be present (up to 4 characters followed by 'a', etc.). (?=X)
matches "X, with a zero-width positive look-ahead", where zero-width means that the character position is not moved while matching.
使用正则表达式执行此操作的速度很慢..这是一个更直接的版本(似乎比使用正则表达式快15倍):
Doing this with regexes is slow, though.. Here's a more direct version (seems about 15x faster than using regular expressions):
public static void find(String haystack, String tokens, int windowLen) {
char[] tokenChars = tokens.toCharArray();
int hayLen = haystack.length();
int pos = 0;
nextPos:
while (pos + windowLen <= hayLen) {
for (char c : tokenChars) {
int i = haystack.indexOf(c, pos);
if (i < 0) return;
if (i - pos >= windowLen) {
pos = i - windowLen + 1;
continue nextPos;
}
}
// match found at pos
System.out.println(pos + ".." + (pos + windowLen - 1) + ": " + haystack.substring(pos, pos + windowLen));
pos++;
}
}
这篇关于匹配令牌序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!