这些原木的grok模式应该是什么? (用于文件拍子的最佳管道) [英] What should be the grok pattern for thoses logs ? (ingest pipeline for filebeat)
问题描述
我是Elasticsearch社区中的新手,希望您能为我所苦苦挣扎的事情提供帮助. 我的目标是使用Filebeat将大量日志文件发送到Elasticsearch. 为此,我需要使用带有Grok模式处理器的摄取节点来解析数据.如果不这样做,我的所有日志将无法被利用,因为每个日志都属于同一消息"字段.不幸的是,我在使用grok regex时遇到了一些问题,但我找不到它,因为这是我第一次使用它. 我的日志如下:
I'm new in the elasticsearch community and I would like your help on something I'm struggeling with. My goal is to send huge quantity of log files to Elasticsearch using Filebeat. In order to do that I need to parse data using ingest nodes with Grok pattern processor. Without doing that, all my logs are not exploitable as each like fall in the same "message" field. Unfortunately I have some issues with the grok regex and I can't find the problem as It's the first time I work with that. My logs look like that:
2016-09-01T10:58:41+02:00 INFO (6): 165.225.76.76 entreprise1 email1@gmail.com POST /application/controller/action Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko {"getid":"1"} 86rkt2dqsdze5if1bqldfl1
2016-09-01T10:58:41+02:00 INFO (6): 165.225.76.76 entreprise2 email2@gmail.com POST /application/controller/action Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko {"getid":"2"} 86rkt2rgdgdfgdfgeqldfl1
2016-09-01T10:58:41+02:00 INFO (6): 165.225.76.76 entreprise3 email3@gmail.com POST /application/controller/action Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko {"getid":"2"}
因此,我们使用制表符作为分隔符,以及以下字段: 日期,IP,公司名称,电子邮件,方法(发布,获取),URL,浏览器,json_request,可选代码
So we have tabs as separator, and those fields: date, ip, company_name, email, method(post,get), url, browser, json_request, optional_code
我的摄取管道json看起来像这样:
My ingest pipeline json looks like that:
PUT _ingest/pipeline/elastic_log_index
{
"description" : "Convert logs txt files",
"processors" : [
{
"grok": {
"field": "message",
"patterns": ["%{TIMESTAMP_ISO8601:timestamp} %{IP:ip} %{WORD:company}% {EMAILADDRESS:email} %{URIPROTO:method} %{URIPATH:page} %{WORD:browser} %{WORD:code}"]
}
},
{
"date" : {
"field" : "timestamp",
"formats" : ["yyyy-MM-ddTHH:mm:ss INFO(6):"]
}
}
],
"on_failure" : [
{
"set" : {
"field" : "error",
"value" : " - Error processing message - "
}
}
]
}
这不起作用.
1)如何逃脱角色?例如,时间戳记末尾的"INFO(6):"
1) How can I escape character(s) ? For example "INFO (6):" at the end of timestamp
2)我可以只在gork模式中的字段之间使用空格吗?文件日志中的分隔符是选项卡.
2) Can I just use space between field in my gork pattern ? Separators in files log are tabs.
3)行尾的代码并不总是出现在日志中,这可能是一个问题吗?
3) The code at the end of lines is not always present in logs, can this be a problem ?
4)您是否知道为什么此配置在弹性搜索下无论如何都无法解析我的日志文档?
4) Do you have ideas why this configuration doesnt parse in anyway my logs document under elasticsearch ?
非常感谢您的帮助,请原谅我的英语水平为法语.
Thanks a lot for your help and excuse my level of english I'm french.
推荐答案
您的grok模式与日志中的所有内容都不匹配,这就是为什么它不起作用的原因.例如,%{WORD}
仅匹配Mozilla
,而不匹配/5.0
.您可以创建自定义模式来匹配整个browser/version
,例如(?<browser>%{WORD}(/%{NUMBER})?)
.
Your grok pattern doesn't match everything in your log which is why it doesn't work. For instance, %{WORD}
will only match Mozilla
, not /5.0
. You can create custom pattern to match entire browser/version
like this (?<browser>%{WORD}(/%{NUMBER})?)
.
只需将INFO (6):
与.*
进行匹配,就可以对INFO (6):
进行转义,并且在输出中会将其忽略.
You can escape INFO (6):
by simply matching it with .*
and it will be ignored in the output.
就空格而言,请使用预定义的grok模式%{SPACE}
进行匹配.
As far as the spaces are concerned, please match them using predefined grok pattern %{SPACE}
.
代码可以通过创建自定义模式(即(?<optional_code>%{WORD}?)
code in the end can become optional by creating a custom pattern, i.e. (?<optional_code>%{WORD}?)
您的整个骗子模式将变成
Your entire grok pattern will then become,
%{TIMESTAMP_ISO8601:timestamp}.*%{IP:ip}%{SPACE}%{WORD:company_name}%{SPACE}%{EMAILADDRESS:email}%{SPACE}%{URIPROTO:method}%{SPACE}%{URIPATH:page}%{SPACE}(?<browser>%{WORD}(/%{NUMBER})?)%{SPACE}\(%{GREEDYDATA:content}\).*\{%{GREEDYDATA:json}\}%{SPACE}(?<optional_code>%{WORD}?)
它将输出,
{
"timestamp": [
[
"2016-09-01T10:58:41+02:00"
]
],
"YEAR": [
[
"2016"
]
],
"MONTHNUM": [
[
"09"
]
],
"MONTHDAY": [
[
"01"
]
],
"HOUR": [
[
"10",
"02"
]
],
"MINUTE": [
[
"58",
"00"
]
],
"SECOND": [
[
"41"
]
],
"ISO8601_TIMEZONE": [
[
"+02:00"
]
],
"ip": [
[
"165.225.76.76"
]
],
"IPV6": [
[
null
]
],
"IPV4": [
[
"165.225.76.76"
]
],
"SPACE": [
[
" ",
" ",
" ",
" ",
" ",
" ",
" "
]
],
"company_name": [
[
"entreprise1"
]
],
"email": [
[
"email1@gmail.com"
]
],
"EMAILLOCALPART": [
[
"email1"
]
],
"HOSTNAME": [
[
"gmail.com"
]
],
"method": [
[
"POST"
]
],
"page": [
[
"/application/controller/action"
]
],
"browser": [
[
"Mozilla/5.0"
]
],
"WORD": [
[
"Mozilla",
"86rkt2dqsdze5if1bqldfl1"
]
],
"NUMBER": [
[
"5.0"
]
],
"BASE10NUM": [
[
"5.0"
]
],
"content": [
[
"Windows NT 6.1; Trident/7.0; rv:11.0"
]
],
"json": [
[
""getid":"1""
]
],
"optional_code": [
[
"86rkt2dqsdze5if1bqldfl1"
]
]
}
在线测试时,请为电子邮件添加自定义模式,因为目前不支持它们,
When testing online please add custom patterns for email, as they are currently not supported,
EMAILLOCALPART [a-zA-Z][a-zA-Z0-9_.+-=:]+
EMAILADDRESS %{EMAILLOCALPART}@%{HOSTNAME}
这篇关于这些原木的grok模式应该是什么? (用于文件拍子的最佳管道)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!