如何在Python中解析此自定义日志文件 [英] How to parse this custom log file in Python

查看:101
本文介绍了如何在Python中解析此自定义日志文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python日志记录在处理时生成日志文件,并且试图将这些日志文件读取到列表/字典中,然后将其转换为JSON并加载到nosql数据库中进行处理.

I am using Python logging to generate log files when processing and I am trying to READ those log files into a list/dict which will then be converted into JSON and loaded into a nosql database for processing.

以以下格式生成文件.

The file gets generated with the following format.

2015-05-22 16:46:46,985 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:46:56,645 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:47:46,488 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:48:48,180 - __main__ - ERROR - Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/
Traceback (most recent call last):
  File "<ipython-input-16-132cda1c011d>", line 10, in <module>
    if numFilesDownloaded == 0:
NameError: name 'numFilesDownloaded' is not defined
2015-05-22 16:49:17,918 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:49:32,160 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:49:39,329 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:53:30,706 - __main__ - INFO - Starting to Wait for Files

注意:实际上,您看到的每个新日期之前都有\ n休息时间,但似乎无法在这里表示出来.

NOTE: There are actually \n breaks before each NEW date you see but cant seem to represent it here.

基本上,我试图读取此文本文件并生成一个如下所示的json对象:

Basically I am trying to read in this text file and produce a json object that looks like this:

{
    'Date': '2015-05-22 16:46:46,985',
    'Type': 'INFO',
    'Message':'Starting to Wait for Files'
}
...

{
    'Date': '2015-05-22 16:48:48,180',
    'Type': 'ERROR',
    'Message':'Failed: Waiting for files the Files from Cloud Storage:  gs://folder/anotherfolder/ Traceback (most recent call last):
               File "<ipython-input-16-132cda1c011d>", line 10, in <module> if numFilesDownloaded == 0: NameError: name 'numFilesDownloaded' is not defined '
}

我遇到的问题:

我可以将每一行添加到列表或字典等中,但是错误消息有时会跨越多行,因此我最终将其错误地分割了.

I can add each line into a list or dict etc BUT the ERROR message sometimes goes over multiple lines so I end up splitting it up incorrectly.

尝试过:

我尝试使用下面的代码仅将有效日期的行分开,但是我似乎无法获得跨越多行的错误消息.我也尝试过使用正则表达式,并认为这是一种可能的解决方案,但似乎找不到合适的正则表达式...不知道它是如何工作的,所以尝试了一堆复制粘贴,但是没有成功.

I have tried to use code like the below to only split the lines on valid dates but I cant seem to get the error messages that go across multiple lines. I also tried regular expressions and think that's a possible solution but cant seem to find the right regex to use...NO CLUE how it works so tried a bunch of copy paste but without any success.

with open(filename,'r') as f:
    for key,group in it.groupby(f,lambda line: line.startswith('2015')):
        if key:
            for line in group:
                listNew.append(line)

尝试了一些疯狂的正则表达式,但在这里也没有运气:

Tried some crazy regex but no luck here either:

logList = re.split(r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])', fileData)

感谢您的帮助...谢谢

Would appreciate any help...thanks

在下面为所有遇到相同问题的人发布了解决方案.

Posted a Solution below for anyone else struggling with the same thing.

推荐答案

使用@Joran Beasley的答案,我想出了以下解决方案,它似乎可行:

Using @Joran Beasley's answer I came up with the following solution and it seems to work:

要点:

  • 我的日志文件始终采用相同的结构:{日期}-{类型}- {Message},所以我使用了字符串切片和拆分的方法来分解项目 需要他们.例如,{日期}始终为23个字符,而我仅 想要前19个字符.
  • 使用line.startswith("2015")很疯狂,因为日期最终会改变,因此创建了一个新函数,该函数使用一些正则表达式来匹配我期望的日期格式.再次,我的日志日期遵循特定的模式,因此我可以变得特定.
  • 将文件读入第一个函数"generateDicts()",然后调用"matchDate()"函数,以查看正在处理的行是否与我正在寻找的{Date}格式匹配.
  • 每当找到有效的{Date}格式时都会创建一个新的dict,并处理所有内容,直到遇到下一个有效的{Date}.
  • My log files ALWAYS follow the same structure: {Date} - {Type} - {Message} so I used string slicing and splitting to get the items broken up how I needed them. Example the {Date} is always 23 characters and I only want the first 19 characters.
  • Using line.startswith("2015") is crazy as dates will change eventually so created a new function that uses some regex to match a date format I am expecting. Once again, my log Dates follow a specific pattern so I could get specific.
  • The file is read into the first function "generateDicts()" and then calls the "matchDate()" function to see IF the line being processed matches a {Date} format I am looking for.
  • A NEW dict is created everytime a valid {Date} format is found and everything is processed until the NEXT valid {Date} is encountered.
def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith(matchDate(line)):
            if currentDict:
                yield currentDict
            currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
        else:
            currentDict["text"] += line
    yield currentDict

with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
    listNew= list(generateDicts(f))

查看正在处理的行是否以与我要查找的格式匹配的{日期}开头的函数

    def matchDate(line):
        matchThis = ""
        matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
        if matched:
            #matches a date and adds it to matchThis            
            matchThis = matched.group() 
        else:
            matchThis = "NONE"
        return matchThis

这篇关于如何在Python中解析此自定义日志文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆