Python正则表达式条件子串提取 [英] python regex conditional substring extraction
问题描述
i am opening this question because it seems my original question requires a new direction: my original question
我想创建一个可从以下类型的日志条目中提取STATIC MESSAGE和DYNAMIC MESSAGE的正则表达式:
i would like to create a regular expression that can extract STATIC MESSAGE and DYNAMIC MESSAGE from the following types of log-entries:
/long/file/name/with.and.extension:Jan 01 12:00:00 TYPE静态消息;动态消息
/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message
/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAMETYPE THREAD.OR.CONNECTION.INFORMATION静态消息;动态消息
/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAME TYPE THREAD.OR.CONNECTION.INFORMATION Static Message;Dynamic Message
一种日志条目类型具有简单的结构:
one log entry type has a simple structure:
文件:日期类型静态;动态
当试图用正则表达式解析时,另一个并不是那么简单:
the other is not so simple when trying to be parsed with regex:
文件:日期MODULE.NAME类型CONNECTION.OR.THREAD STATIC;动态
其中 MODULE.NAME
和 CONNECTION.OR.THREAD
都存在或不存在.
where the MODULE.NAME
and CONNECTION.OR.THREAD
are either both present or not present.
到目前为止,适用于第一种日志条目的正则表达式为:
my regular expression so far which works on the first type of log entry is:
(?:.*?):(?:\w{3} \d{1,2} \d{1,2}:\d{1,2}:\d{1,2})(?:\s+?)(?:[\S|\.]*?(?:\s*?))?(?:(?:TYPE1)|(?:TYPE2)|(?:TYPE3))(?:\s+?)(?:\S+?(?:\s+?))?(.+){1}(?:;(.+)){1}
但是每当我进入第二种类型时,我也将CONNECTION.OR.THREAD作为我的第一个捕获组的一部分.
but whenever i get to the second type of entry, i am also getting the CONNECTION.OR.THREAD as part of my first capturing group.
我希望找到一种使用超前或后退功能的方法,以便我可以捕获 STATIC
和 DYNAMIC
而忽略 CONNECTION.OR.THREAD
部分,如果有 MODULE.NAME
吗?
i am hoping for a way to use the lookahead or lookbehind feature so that i can capture STATIC
and DYNAMIC
and ignore the CONNECTION.OR.THREAD
part if there is a MODULE.NAME
?
我希望这个问题很清楚,如果看起来有点暗淡,请参阅我的原文.谢谢.
i hope this question is clear, please refer to my original if it seems a bit bleak. thank you.
编辑:以进行澄清.日志的每一行都与其他行不同,每行以文件路径开头,然后是:
和日期,格式如下: MMM DD HH:MM:SS
,然后变得棘手,要么是 MODULE.NAME
有所不同,然后是 TYPE
也有所不同,然后是 CONNECTION.OR.THREAD
会有所不同,或者只是 TYPE
.之后是 STATIC MESSAGE
,然后是;
,然后是 DYNAMIC MESSAGE
,静态消息和动态消息都不同,术语的用法> STATIC
仅仅是因为错误例如可能是无法连接到服务器; server1.com",所以错误的静态部分是无法连接到服务器",而动态部分是"server1.com""
for clarification. every line of the log is different then the others, each line starts with a filepath, then a :
then the date, in the following format: MMM DD HH:MM:SS
and then it gets tricky, either a MODULE.NAME
which varies, followed by the TYPE
which also varies, followed by CONNECTION.OR.THREAD
which varies, or with just the TYPE
. after which there is the STATIC MESSAGE
then a ;
then a DYNAMIC MESSAGE
both the static and dynamic message vary, the usage of the term STATIC
is simply because an error can be for instance "unable to connect to server; server1.com" so the static part of the error is "unable to connect to server" and the dynamic part is "server1.com"
推荐答案
此刻我制作了这个正则表达式:
at the moment i have made this massive regex:
(?:(?:.*?):(?:\w{3}(?: \d{1,2}){2}(?::\d{1,2}){2}))(?:\s+?)(?:(?:(?:(?:TYPE1)|(?:(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:(.+){1};(.+){1}))|(?:\S+(?:\.\S+)+)(?:\s+?)(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:\S+(?:\.\S+)+)(?:\s+?)(?:(.+){1};(.+){1})))
我会将其分为几部分:
文件/日期+空格:
(?:(?:.*?):(?:\w{3}(?: \d{1,2}){2}(?::\d{1,2}){2}))(?:\s+?)
然后是EITHER:
简单:(类型静态;动态)
SIMPLE: (TYPE STATIC;DYNAMIC)
(?:(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:(.+){1};(.+){1}))
或复杂:(模块名称类型连接或线程静态;动态)
OR COMPLEX: (MODULE.NAME TYPE CONNECTION.OR.THREAD STATIC;DYNAMIC)
(:?(?:\S+(?:\.\S+)+)(?:\s+?)(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:\S+(?:\.\S+)+)(?:\s+?)(?:(.+){1};(.+){1}))
可以解决问题.但它巨大,我认为可以改进.因此,如果有人可以改善它,请这样做.
it does the trick. but its huge and i think it can be improved. so please if anyone can improve it, please do.
但是有一个问题.因为现在有4个捕获组.所以我无法提前知道我是否必须查看捕获的[0:1]或捕获的[2:3]作为结果.任何人都可以做到这一点,我将不必每次都检查是否有东西吗?还是从结果中消除空捕获组的方法,或者仅从结果列表中获得非空结果?某物?我的大脑被炸了.
there is a problem though. because now there are 4 capturing groups. so i can not know ahead of time if i must look in captured[0:1] or captured[2:3] for my results. anyone have a way to do this that i will not have to check each time if i have something there? or perhaps a way to eliminate empty capturing groups from results, or maybe to only get non-empty results from the list of results? something? my brain is fried.
正如@martijn pieters所建议的,我删除了无关的分组,这是我当前的正则表达式:
as @martijn pieters suggested i removed the extraneous grouping this is my current regex:
.*?:\w{3}(?: \d{1,2}){2}(?::\d{2}){2}\s+?(?:(?:(?:TYPE1|TYPE2|TYPE3)\s+?(.+){1};(.+){1})|(?:\S+(?:\.\S+)+\s+?(?:TYPE1|TYPE2|TYPE3)\s+?\S+(?:\.\S+)+\s+?(.+){1};(.+){1}))
效果很好.我担心(?: TYPE1 | TYPE2 | TYPE3)
被误解为 TYPE(1 | T)YPE(2 | T)YPE3
,因此,不胜感激.
which works fine. i am concerned about (?:TYPE1|TYPE2|TYPE3)
being miss-interpreted as TYPE(1|T)YPE(2|T)YPE3
any insight would be appreciated.
此外,如何最好地解析我的结果-看到我将得到4个项目的列表,其中前2个或后2个为空,而另一个具有我的静态/动态结果.
also, how best to go about parsing my results - seeing as i will get a list of 4 items with either the first 2 or the second 2 being empty and the other having my static/dynamic results.
好的,我已经完成了混合解决方案.我已经重新制作了我的正则表达式:
okay, i have done a hybrid solution. i have remade my regular expression:
.*?:\w{3}(?: \d{1,2}){2}(?::\d{2}){2}\s+?(?:(?:(?:TYPE1|TYPE2|TYPE3))|(?:\S+(?:\.\S+)+\s+?(?:TYPE1|TYPE2|TYPE3)\s+?\S+(?:\.\S+)+))\s+(.*)
i现在只有1个捕获组,这是STATIC; DYNAMIC部分.一旦获得此提示,我便会做我之前的工作(请参阅我的上一个问题)
i now only have 1 capture group, which is the STATIC;DYNAMIC part. once i get this i do what i was doing before (see my previous question)
for item in captured:
parts = item.split(";")
static = parts[0]
dynamic = ";".join(parts[1:])
那是我的解决方案.尤其感谢@Martijn Pieters的帮助.我希望这可以在将来对某人有所帮助.
that is my solution. thank you @Martijn Pieters especially for your help. i hope this can help someone in the future.
这篇关于Python正则表达式条件子串提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!