在猪脚本中使用正则表达式从日志中提取字符串 [英] Extracting string from logs with regex in pig script
问题描述
我有日志数据,我想将每个信息提取到一个变量中
I have log data and I want to extract each information into a variable
以下是单行日志示例.{:id=>306, :name=>"bblite", :cpu=>{:quota=>4, :allocated=>4, :actual=>0}, :memory=>{:quota=>8192,:allocated=>8192, :actual=>8578}, :cluster_stats=>{"wc1104"=>{:cpu=>0, :mem=>8578}}}
The following is sample one line log. {:id=>306, :name=>"bblite", :cpu=>{:quota=>4, :allocated=>4, :actual=>0}, :memory=>{:quota=>8192, :allocated=>8192, :actual=>8578}, :cluster_stats=>{"wc1104"=>{:cpu=>0, :mem=>8578}}}
我需要一个包含所有 ID 的变量、一个包含所有名称的变量、一个包含 CPU 的变量和一个包含所有集群统计信息的变量
I need variable that have all ids,a variable that have all names,a variable that have CPUs and a variable that have all cluster stats
以下是我的猪脚本部分.我可以存储 ID,但我不知道如何使用正则表达式提取其余的 ID.
The following is the portion of my pig script. I can store the ids but I have no idea how to extract the rest of them using regex.
...
matching_messages = FILTER raw_lines BY (LOWER(message) MATCHES '.*cc_altus-plaform.*');
ids = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'id=>\\d*',0);
names = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'name=>\\"\\",',0);
line_with_date = FOREACH matching_messages GENERATE
DateFormatter(timestamp) AS formatted_time: chararray, message;
DUMP names;
推荐答案
以下代码片段是我编写的有效正则表达式:
The following codes snippet is the regex I have written which works:
id = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'(?<=id=>)\\d*',0);
name = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'name=>\\"[\\w]*\\"',0);
cpu = FOREACH matching_messages GENERATE REPLACE( REGEX_EXTRACT(message, 'cpu=>\\{.*?\\}',0), ',','');
memory = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'memory=>\\{.*?\\}',0);
cluster = FOREACH matching_messages GENERATE REGEX_EXTRACT(message,'cluster_stats=>\\{.*?\\}',0);
这篇关于在猪脚本中使用正则表达式从日志中提取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!