Hive RegexSerDe多行日志匹配 [英] Hive RegexSerDe Multiline Log matching

查看:102
本文介绍了Hive RegexSerDe多行日志匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找可以

"input.regex"="the regex goes here"

条件是RegexSerDe必须读取的文件中的日志具有以下格式:

The condition is that the logs in the files that the RegexSerDe must be reading are of the following form:

2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line.
2013-02-12 12:03:24,527 [DEBUG] 265y7d3e-432g-dfg3-dwq3-y4dsfq3ew91b Some other message that can contain any special character, including linebreaks. This one does not have one either. It just has spaces on the same line.
2013-02-12 12:03:24,946 [ERROR] 261rtd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks.
 This is a special one.
 This has a message that is multi-lined.
 This is line number 4 of the same log.
 Line 5.
2013-02-12 12:03:24,988 [INFO] 2632323e-432g-dfg3-dwq3-y4dsfq3ew91b Another 1-line log
2013-02-12 12:03:25,121 [DEBUG] 263tgd3e-432g-dfg3-dwq3-y4dsfq3ew91b Yet another one line log.

我正在使用以下创建外部表代码:

I am using the following create external table code:

CREATE EXTERNAL TABLE applogs (logdatetime STRING, logtype STRING, requestid STRING, verbosedata STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
(
"input.regex" = "(\\A[[0-9:-] ]{19},[0-9]{3}) (\\[[A-Z]*\\]) ([0-9a-z-]*) (.*)?(?=(?:\\A[[0-9:-] ]{19},[0-9]|\\z))",
"output.format.string" = "%1$s \\[%2$s\\] %3$s %4$s"
)
STORED AS TEXTFILE
LOCATION 'hdfs:///logs-application';

这是东西:

它能够提取每个日志的所有第一行.但是具有多行的其他日志行则不然.我尝试了所有链接,最后将 \ z 替换为 \ Z ,将 \ A 替换为 ^ \ Z \ z $ ,没有任何效果.我是否在output.format.string的%4 $ s 中丢失了某些内容?还是我没有正确使用正则表达式?

It is able to pull all the FIRST LINES of each log. But not the other lines of logs that have more than one lines. I tried all links, replaced \z with \Z at the end, replaced \A with ^ and \Z or \z with $, nothing worked. Am I missing something in the output.format.string's %4$s? or am I not using the regex properly?

正则表达式的作用:

它首先匹配时间戳,然后是日志类型( DEBUG INFO 或其他),然后是 ID (混合使用较低的字母,数字和连字符),然后再加上任何内容,直到找到下一个时间戳,或者直到找到输入的末尾以匹配最后一个日志条目为止.我还尝试在最后添加/m ,在这种情况下,生成的表具有所有NULL值.

It matches the timestamp first, followed by the log type (DEBUG or INFO or whatever), then the ID (mix of lower case alphabets, numbers and hyphens) followed by ANYTHING, till the next timestamp is found, or till the end of input is found to match the last log entry. I also tried adding the /m at the end, in which case, the table generated has all NULL values.

推荐答案

遵循Java正则表达式可能会有所帮助:

Following Java regex may help:

(\d{4}-\d{1,2}-\d{1,2}\s+\d{1,2}:\d{1,2}:\d{1,2},\d{1,3})\s+(\[.+?\])\s+(.+?)\s+([\s\S\s]+?)(?=\d{4}-\d{1,2}-\d{1,2}|\Z)

故障:

  • 第一个捕获组(\ d {4}-\ d {1,2}-\ d {1,2} \ s + \ d {1,2}:\ d {1,2}:\d {1,2},\ d {1,3})
  • 第二个捕获组(\ [.+?\])
  • 第三次捕获组(.+?)
  • 第四个捕获组([\ s \ S] +?).

(?= \ d {4}-\ d {1,2}-\ d {1,2} | \ Z)正向超前-声明以下正则表达式可以匹配.第一种选择: \ d {4}-\ d {1,2}-\ d {1,2} .第二种选择: \ Z 在末尾声明位置细绳.

(?=\d{4}-\d{1,2}-\d{1,2}|\Z) Positive Lookahead - Assert that the regex below can be matched.1st Alternative: \d{4}-\d{1,2}-\d{1,2}.2nd Alternative: \Z assert position at end of the string.

参考 http://regex101.com/

这篇关于Hive RegexSerDe多行日志匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆