使用MATLAB正则表达式将重叠模式与捕获匹配 [英] Match overlapping patterns with capture using a MATLAB regular expression
问题描述
我正在尝试解析一个如下所示的日志文件:
I'm trying to parse a log file that looks like this:
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
...
此节选包含两个我要提取的时间段,从第一个定界符到第二个定界符,以及从第二个定界符到第三个定界符.我想使用正则表达式提取每个间隔的开始和结束时间.这通常可以工作:
This excerpt contains two time periods I'd like to extract, from the first delimiter to the second, and from the second to the third. I'd like to use a regular expression to extract the start and stop times for each of these intervals. This mostly works:
p = '%{4} (?<start>.*?)\n% Starting (?<name>.*?)\n.*?%{4} (?<stop>.*?)\n';
times = regexp(c,p,'names');
返回:
times =
1x16 struct array with fields:
start
name
stop
问题在于,由于第二个定界符被用作第一个匹配项的一部分,因此它仅捕获其他所有周期.
The problem is that this only captures every other period, since the second delimiter is consumed as part of the first match.
在其他语言中,您可以使用环视运算符(向前,向后看)来解决此问题. 关于正则表达式的文档解释了这些内容在MATLAB中工作,但在捕获匹配项的同时仍无法使它们工作.也就是说,我不仅需要能够匹配每个定界符,而且还需要提取该匹配的一部分(时间戳).
In other languages, you can use lookaround operators (lookahead, lookbehind) to solve this problem. The documentation on regular expressions explains how these work in MATLAB, but I haven't been able to get these to work while still capturing the matches. That is, I not only need to be able to match every delimiter, but also I need to extract part of that match (the timestamp).
这可能吗?
P.S.我意识到我可以通过编写一个简单的状态机或通过在定界符上进行匹配和后处理来解决此问题,如果没有办法可以使它工作.
P.S. I realize I can solve this problem by writing a simple state machine or by matching on the delimiters and post-processing, if there's no way to get this to work.
更新:谢谢大家的解决方法.我收到了开发人员的来信,目前无法使用MATLAB中的正则表达式引擎来做到这一点.
Update: Thanks for the workaround ideas, everyone. I heard from the developer and there's currently no way to do this with the regular expression engine in MATLAB.
推荐答案
MATLAB似乎无法在不将字符从字符串中删除的情况下将字符捕获为令牌(或者,我应该说 I 为此,请使用MATLAB REGEXP ).但是,通过注意到一个文本块的停止时间等于下一个文本的开始时间,我能够使用REGEXP捕获仅开始时间和名称,然后进行一些简单处理以从文本框中获取停止时间.开始时间.我使用了以下示例文本:
MATLAB seems unable to capture characters as a token without removing them from the string (or, I should say, I was unable to do so using MATLAB REGEXP). However, by noting that the stop time for one block of text is equal to the start time of the next, I was able to capture just the start times and the names using REGEXP, then do some simple processing to get the stop times from the start times. I used the following sample text:
c =
%%%% 09-May-2009 04:10:29
% Starting foo
this is stuff
to ignore
%%%% 09-May-2009 04:10:50
% Starting bar
more stuff
to ignore
%%%% 09-May-2009 04:11:29
some more junk
...并应用以下表达式:
...and applied the following expression:
p = '%{4} (?<start>[^\n]*)\n% Starting (?<name>[^\n]*)[^%]*|%{4} (?<start>[^\n]*).*';
然后可以使用以下代码完成处理:
The processing can then be done with the following code:
names = regexp(c,p,'names');
[names.stop] = deal(names(2:end).start,[]);
names = names(1:end-1);
...上面的示例文本为我们提供了这些结果:
...which gives us these results for the above sample text:
>> names(1)
ans =
start: '09-May-2009 04:10:29'
name: 'foo'
stop: '09-May-2009 04:10:50'
>> names(2)
ans =
start: '09-May-2009 04:10:50'
name: 'bar'
stop: '09-May-2009 04:11:29'
这篇关于使用MATLAB正则表达式将重叠模式与捕获匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!