在Pig脚本中使用正则表达式从日志中提取字符串 [英] Extracting string from logs with regex in pig script
问题描述
我有日志数据,我想将每个信息提取到一个变量中
I have log data and I want to extract each information into a variable
以下是示例一行日志. {:id => 306,:name =>"bblite",:cpu => {:quota => 4,:allocated => 4,:actual => 0},:memory => {:quota => 8192, :allocated => 8192,:actual => 8578},:cluster_stats => {"wc1104" => {:cpu => 0,:mem => 8578}}}}
The following is sample one line log. {:id=>306, :name=>"bblite", :cpu=>{:quota=>4, :allocated=>4, :actual=>0}, :memory=>{:quota=>8192, :allocated=>8192, :actual=>8578}, :cluster_stats=>{"wc1104"=>{:cpu=>0, :mem=>8578}}}
我需要具有所有ID的变量,具有所有名称的变量,具有CPU的变量和具有所有群集统计信息的变量
I need variable that have all ids,a variable that have all names,a variable that have CPUs and a variable that have all cluster stats
以下是我的猪脚本的一部分.我可以存储ID,但不知道如何使用正则表达式提取其余ID.
The following is the portion of my pig script. I can store the ids but I have no idea how to extract the rest of them using regex.
. .
matching_messages = FILTER raw_lines BY (LOWER(message) MATCHES '.*cc_altus-plaform.*');
ids = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'id=>\\d*',0);
names = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'name=>\\"\\",',0);
line_with_date = FOREACH matching_messages GENERATE
DateFormatter(timestamp) AS formatted_time: chararray, message;
DUMP names;
推荐答案
以下代码段是我编写的可运行的正则表达式:
The following codes snippet is the regex I have written which works:
id = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'(?<=id=>)\\d*',0);
name = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'name=>\\"[\\w]*\\"',0);
cpu = FOREACH matching_messages GENERATE REPLACE( REGEX_EXTRACT(message, 'cpu=>\\{.*?\\}',0), ',','');
memory = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'memory=>\\{.*?\\}',0);
cluster = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'cluster_stats=>\\{.*?\\}',0);
这篇关于在Pig脚本中使用正则表达式从日志中提取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!