非常具体的子字符串检索和拆分 [英] very specific substring retrieval and split

查看:58
本文介绍了非常具体的子字符串检索和拆分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我知道有很多关于子串的帖子,相信我已经搜索了很多关于这种问题的答案.

i know there are tons of posts about sub-stringing, believe me i have searched through many of them looking for an answer to this.

我有很多字符串、日志中的行,我正在尝试对它们进行分类和解析.

i have many strings, lines from a log, and i am trying to categorize and parse them.

它们看起来像这样:

/long/file/name/with.and.extension:Jan 01 12:00:00 TYPE静态消息;动态消息

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message

其中文件名是日志所在的文件,日期是消息放入日志的日期/时间,类型是消息的类型,然后消息由两部分组成,静态部分和动态部分,消息的静态部分不变,动态部分可以改变(很明显),它们由; 分割,但是可以有更多的; 中的动态部分.

where the filename is the file where the log is located, the date is the date/time that the message was put into the log, and the TYPE is the type of message, and then the message is composed of two parts, a static part and a dynamic part, the static part does not change for the message and the dynamic part can change (obviously) and they are split by a ; but there can be more ; in the dynamic part.

我希望能够提取静态消息和动态消息.

i want to be able to extract the Static Message, and the Dynamic Message.

到目前为止,我一直在使用这样的东西:

so far i have been using something like this:

parts = line.split(";")
static = parts[0]
dynamic = ";".join(parts[1:])

不是很漂亮.而且我的静态部分包含文件名,日期和类型,这是我所不希望的.所以我想我会做这样的事情:

not very pretty. and also my static part contains the filename and the date and the type, which i do not want. so then i thought i would do something like this:

parts = " ".join(":".join(line.split(":")[1:]).split(" ")[4:]).split(";")
static = parts[0]
dynamic = ";".join(parts[1:])

我尝试过的

,它在某种程度上可以工作,除了有时文件名可能有一个空格,或者TYPE可能有一个空格,或者某些东西无法正常工作,我有时把TYPE作为静态消息的一部分.效率是一个问题,因为这些日志成千上万行,每天必须对其进行解析和分类.所以我想知道是否有比此hack-job更好的方法呢?

which i have tried, and it works to some extent, except sometimes the filename might have a space, or the TYPE might have a space or something isnt working properly and i sometimes get the TYPE as part of the static message... efficiency is an issue since these are thousands of lines of logs which must be parsed and categorized daily. so i am wondering if there is a better way to do this other than this hack-job??

编辑:我想我会在日志中提供更多行示例.为了修正我之前所说的内容,有几种类型的条目.

edit: i thought i would provide more examples of lines in the log. to fix what i said earlier, there are a few types of entries.

/long/file/name/with.and.extension:Jan 01 12:00:00 TYPE静态消息;动态消息

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 TYPE Static Message;Dynamic Message

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAME TYPE THREAD.OR.CONNECTION.INFORMATION静态消息;动态消息

/long/file/name/with.dots.and.extension:Jan 01 12:00:00 MODULE.NAME TYPE THREAD.OR.CONNECTION.INFORMATION Static Message;Dynamic Message

如您所见,

-有两种类型的日志条目.那些不带模块的模块和那些带模块的模块可以连接到连接,有些可以连接到线程.这使解析变得更加困难.

so as you can see - there are some two types of log entries. those without modules and those with, those with modules can either be connected to connections, and some to threads. this makes the parsing harder.

推荐答案

您可以将拆分限制为第一个';'仅:

You can limit the split to the first ';' only:

static, dynamic = line.split(';', 1)

您的静态部分拆分可能需要做更多的工作,但是如果您知道第一部分中的空间数量将是静态的,那么也许可以使用相同的技巧:

Your static part splitting might take a little more doing, but if you know the number of spaces is going to be static in the first part, perhaps the same trick could work there:

static = static.split(' ', 4)[-1]

如果该行的第一部分比较复杂(TYPE部分中的空格),我担心在此之前删除所有内容会更加困难.最好的选择是找出 TYPE 可以假定的有限值集,并使用带有该信息的正则表达式拆分静态部分.

If the first part of the line is more complex (spaces in the TYPE part) I fear that removing everything before that is going to be a more difficult affair. Your best bet is to figure out the limited set of values TYPE could assume and to use a regular expression with that information to split the static part.

这篇关于非常具体的子字符串检索和拆分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆