Python正则表达式条件子串提取 [英] python regex conditional substring extraction

查看:77
本文介绍了Python正则表达式条件子串提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我打开这个问题是因为似乎我的原始问题需要一个新的方向:

i am opening this question because it seems my original question requires a new direction: my original question

我想创建一个可从以下类型的日志条目中提取STATIC MESSAGE和DYNAMIC MESSAGE的正则表达式:

i would like to create a regular expression that can extract STATIC MESSAGE and DYNAMIC MESSAGE from the following types of log-entries:

/long/file/name/with.and.extension:Jan 01 12:00:00 TYPE静态消息;动态消息

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAMETYPE THREAD.OR.CONNECTION.INFORMATION静态消息;动态消息

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAME TYPE THREAD.OR.CONNECTION.INFORMATION Static Message;Dynamic Message

一种日志条目类型具有简单的结构:

one log entry type has a simple structure:

文件:日期类型静态;动态

当试图用正则表达式解析时,另一个并不是那么简单:

the other is not so simple when trying to be parsed with regex:

文件:日期MODULE.NAME类型CONNECTION.OR.THREAD STATIC;动态

其中 MODULE.NAME CONNECTION.OR.THREAD 都存在或不存在.

where the MODULE.NAME and CONNECTION.OR.THREAD are either both present or not present.

到目前为止,适用于第一种日志条目的正则表达式为:

my regular expression so far which works on the first type of log entry is:

(?:.*?):(?:\w{3} \d{1,2} \d{1,2}:\d{1,2}:\d{1,2})(?:\s+?)(?:[\S|\.]*?(?:\s*?))?(?:(?:TYPE1)|(?:TYPE2)|(?:TYPE3))(?:\s+?)(?:\S+?(?:\s+?))?(.+){1}(?:;(.+)){1}

但是每当我进入第二种类型时,我也将CONNECTION.OR.THREAD作为我的第一个捕获组的一部分.

but whenever i get to the second type of entry, i am also getting the CONNECTION.OR.THREAD as part of my first capturing group.

我希望找到一种使用超前或后退功能的方法,以便我可以捕获 STATIC DYNAMIC 而忽略 CONNECTION.OR.THREAD 部分,如果有 MODULE.NAME 吗?

i am hoping for a way to use the lookahead or lookbehind feature so that i can capture STATIC and DYNAMIC and ignore the CONNECTION.OR.THREAD part if there is a MODULE.NAME ?

我希望这个问题很清楚,如果看起来有点暗淡,请参阅我的原文.谢谢.

i hope this question is clear, please refer to my original if it seems a bit bleak. thank you.

编辑:以进行澄清.日志的每一行都与其他行不同,每行以文件路径开头,然后是:和日期,格式如下: MMM DD HH:MM:SS ,然后变得棘手,要么是 MODULE.NAME 有所不同,然后是 TYPE 也有所不同,然后是 CONNECTION.OR.THREAD 会有所不同,或者只是 TYPE .之后是 STATIC MESSAGE ,然后是; ,然后是 DYNAMIC MESSAGE ,静态消息和动态消息都不同,术语的用法> STATIC 仅仅是因为错误例如可能是无法连接到服务器; server1.com",所以错误的静态部分是无法连接到服务器",而动态部分是"server1.com""

for clarification. every line of the log is different then the others, each line starts with a filepath, then a : then the date, in the following format: MMM DD HH:MM:SS and then it gets tricky, either a MODULE.NAME which varies, followed by the TYPE which also varies, followed by CONNECTION.OR.THREAD which varies, or with just the TYPE. after which there is the STATIC MESSAGE then a ; then a DYNAMIC MESSAGE both the static and dynamic message vary, the usage of the term STATIC is simply because an error can be for instance "unable to connect to server; server1.com" so the static part of the error is "unable to connect to server" and the dynamic part is "server1.com"

推荐答案

此刻我制作了这个正则表达式:

at the moment i have made this massive regex:

(?:(?:.*?):(?:\w{3}(?: \d{1,2}){2}(?::\d{1,2}){2}))(?:\s+?)(?:(?:(?:(?:TYPE1)|(?:(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:(.+){1};(.+){1}))|(?:\S+(?:\.\S+)+)(?:\s+?)(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:\S+(?:\.\S+)+)(?:\s+?)(?:(.+){1};(.+){1})))

我会将其分为几部分:

文件/日期+空格:

(?:(?:.*?):(?:\w{3}(?: \d{1,2}){2}(?::\d{1,2}){2}))(?:\s+?)

然后是EITHER:

简单:(类型静态;动态)

SIMPLE: (TYPE STATIC;DYNAMIC)

(?:(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:(.+){1};(.+){1}))

或复杂:(模块名称类型连接或线程静态;动态)

OR COMPLEX: (MODULE.NAME TYPE CONNECTION.OR.THREAD STATIC;DYNAMIC)

(:?(?:\S+(?:\.\S+)+)(?:\s+?)(?:(?:TYPE1)|(?:TYPE1)|(?:TYPE3))(?:\s+?)(?:\S+(?:\.\S+)+)(?:\s+?)(?:(.+){1};(.+){1}))

可以解决问题.但它巨大,我认为可以改进.因此,如果有人可以改善它,请这样做.

it does the trick. but its huge and i think it can be improved. so please if anyone can improve it, please do.

但是有一个问题.因为现在有4个捕获组.所以我无法提前知道我是否必须查看捕获的[0:1]或捕获的[2:3]作为结果.任何人都可以做到这一点,我将不必每次都检查是否有东西吗?还是从结果中消除空捕获组的方法,或者仅从结果列表中获得非空结果?某物?我的大脑被炸了.

there is a problem though. because now there are 4 capturing groups. so i can not know ahead of time if i must look in captured[0:1] or captured[2:3] for my results. anyone have a way to do this that i will not have to check each time if i have something there? or perhaps a way to eliminate empty capturing groups from results, or maybe to only get non-empty results from the list of results? something? my brain is fried.

正如@martijn pieters所建议的,我删除了无关的分组,这是我当前的正则表达式:

as @martijn pieters suggested i removed the extraneous grouping this is my current regex:

.*?:\w{3}(?: \d{1,2}){2}(?::\d{2}){2}\s+?(?:(?:(?:TYPE1|TYPE2|TYPE3)\s+?(.+){1};(.+){1})|(?:\S+(‌​?:\.\S+)+\s+?(?:TYPE1|TYPE2|TYPE3)\s+?\S+(?:\.\S+)+\s+?(.+){1};(.+){1}))

效果很好.我担心(?: TYPE1 | TYPE2 | TYPE3)被误解为 TYPE(1 | T)YPE(2 | T)YPE3 ,因此,不胜感激.

which works fine. i am concerned about (?:TYPE1|TYPE2|TYPE3) being miss-interpreted as TYPE(1|T)YPE(2|T)YPE3 any insight would be appreciated.

此外,如何最好地解析我的结果-看到我将得到4个项目的列表,其中前2个或后2个为空,而另一个具有我的静态/动态结果.

also, how best to go about parsing my results - seeing as i will get a list of 4 items with either the first 2 or the second 2 being empty and the other having my static/dynamic results.

好的,我已经完成了混合解决方案.我已经重新制作了我的正则表达式:

okay, i have done a hybrid solution. i have remade my regular expression:

.*?:\w{3}(?: \d{1,2}){2}(?::\d{2}){2}\s+?(?:(?:(?:TYPE1|TYPE2|TYPE3))|(?:\S+(?:\.\S+)+\s+?(?:TYPE1|TYPE2|TYPE3)\s+?\S+(?:\.\S+)+))\s+(.*)

i现在只有1个捕获组,这是STATIC; DYNAMIC部分.一旦获得此提示,我便会做我之前的工作(请参阅我的上一个问题)

i now only have 1 capture group, which is the STATIC;DYNAMIC part. once i get this i do what i was doing before (see my previous question)

for item in captured:
    parts = item.split(";")
    static = parts[0]
    dynamic = ";".join(parts[1:])

那是我的解决方案.尤其感谢@Martijn Pieters的帮助.我希望这可以在将来对某人有所帮助.

that is my solution. thank you @Martijn Pieters especially for your help. i hope this can help someone in the future.

这篇关于Python正则表达式条件子串提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆