apache pig-将网址解析为地图 [英] apache pig - url parsing into a map

查看:73
本文介绍了apache pig-将网址解析为地图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对Pig很陌生,对日志解析有疑问.我目前正在通过regex_extract解析URL字符串中的重要标记,但是我认为我应该将整个字符串转换为地图.我正在使用0.10来处理一组数据样本,但现在开始真的迷路了.实际上,我的url字符串具有重复的标签.因此,我的地图实际上应该是一个以bag为值的地图.然后,我可以使用flatten编写任何后续工作.

I am pretty new to pig and have a question with log parsing. I currently parse out important tags in my url string via regex_extract, but am thinking I should transform the whole string to a map. I am working on a sample set of data using 0.10, but am starting to get really lost. In reality, my url string has tags repeated. So my map should actually be a map with bags as the values. Then i could just write any subsequent job using flatten..

这是我的测试数据.最后一个条目显示了我重复标签的问题.

here is my test data. the last entry shows my problem with repeated tags.

`pig -x local`
grunt> cat test.log
test1   user=3553&friend=2042&system=262
test2   user=12523&friend=26546&browser=firfox
test2   user=205&friend=3525&friend=353

我正在使用标记化来生成一个内袋.

I am using a tokenize to generate an inner bag.

grunt> A = load 'test.log' as (f:chararray, url:chararray);
grunt> B = foreach A generate f, TOKENIZE(url,'&') as attr;
grunt> describe B;
B: {f: chararray,attr: {tuple_of_tokens: (token: chararray)}}

grunt> dump B;
(test1,{(user=3553),(friend=2042),(system=262)})
(test2,{(user=12523),(friend=26546),(browser=firfox)})
(test2,{(user=205),(friend=3525),(friend=353)})

在这些关系上使用嵌套的foreach,但我认为它们有一些我不知道的限制.

Using nested foreach on these relations, but i think they have some limitations I am not aware of..

grunt> C = foreach B {
>> D = foreach attr generate STRSPLIT($0,'=');
>> generate f, D as taglist;
>> }

grunt> dump C;
(test1,{((user,3553)),((friend,2042)),((system,262))})
(test2,{((user,12523)),((friend,26546)),((browser,firfox))})
(test2,{((user,205)),((friend,3525)),((friend,353))})

grunt> G = foreach C {
>> H = foreach taglist generate TOMAP($0.$0, $0.$1) as tagmap;
>> generate f, H as alltags;
>> }

grunt> describe G;
G: {f: chararray,alltags: {tuple_of_tokens: (tagmap: map[])}}

grunt> dump G;
(test1,{([user#3553]),([friend#2042]),([system#262])})
(test2,{([user#12523]),([friend#26546]),([browser#firfox])})
(test2,{([user#205]),([friend#3525]),([friend#353])})

grunt> MAPTEST = foreach G generate f, flatten(alltags.tagmap);
grunt> describe MAPTEST;
MAPTEST: {f: chararray,null::tagmap: map[]}

grunt> res = foreach MAPTEST generate $1#'user';
grunt> dump res;
(3553)
()
()
(12523)
()
()
(205)
()
()

grunt> res = foreach MAPTEST generate $1#'friend';
grunt> dump res;
()
(2042)
()
()
(26546)
()
()
(3525)
(353)

所以这并不可怕.我认为它接近但不完美.我更担心的是,我需要对标签进行分组,因为至少在将其添加到地图之前,最后一行有2个用于朋友"的标签.

So that's not terrible. I think its close, but not perfect. My bigger concern is that I need to group the tags as the last line has 2 tags for "friend", at least before I add it to the map.

grunt> dump C;
(test1,{((user,3553)),((friend,2042)),((system,262))})
(test2,{((user,12523)),((friend,26546)),((browser,firfox))})
(test2,{((user,205)),((friend,3525)),((friend,353))})

我尝试使用一组嵌套的foreach,但是那会导致错误.

I try the nested foreach with a group but thats causing an error.

grunt> G = foreach C {
>> H = foreach taglist generate *;
>> I = group H by $1;
>> generate I;
>> }
2013-01-18 14:56:31,434 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200:   <line 34, column 10>  Syntax error, unexpected symbol at or near 'H'

任何人都知道如何更接近将这个URL字符串生成到行囊图中吗?猜想会有一个Pig宏之类的东西,因为这似乎是一个普通的用例.任何想法都非常感谢.

Anyone have any ideas how to get closer to generating this URL string into a map of bags? Figured there'd be a pig macro or something, since this seems like a common use case. Any ideas are very much appreciated.

推荐答案

好消息和坏消息.好消息是要实现这一目标非常简单.坏消息是,如果不使用UDF,您将无法实现我认为是理想的-单个映射中的所有标记/值对.

Good news and bad news. The good news is it is pretty simple to achieve this. The bad news is that you will not be able to achieve what I would presume is the ideal -- all of the tag/value pairs in a single map -- without resorting to a UDF.

首先,请注意以下几点:FLATTEN STRSPLIT的结果,以便在元组中没有不必要的嵌套层次,而在嵌套的foreach中再次出现FLATTEN,这样您就不必以后不需要做.另外,STRSPLIT具有可选的第三个参数,以给出最大输出字符串数.使用它来保证其输出的模式.这是脚本的修改版本:

First, a couple tips: FLATTEN the result of STRSPLIT so that you don't have a useless level of nesting in your tuples, and FLATTEN again inside the nested foreach so that you don't need to do it later. Also, STRSPLIT has an optional third argument to give the maximum number of output strings. Use that to guarantee a schema for its output. Here's a modified version of your script:

A = load 'test.log' as (f:chararray, url:chararray);
B = foreach A generate f, TOKENIZE(url,'&') as attr;
C = foreach B {
    D = foreach attr generate FLATTEN(STRSPLIT($0,'=',2)) AS (key:chararray, val:chararray);
    generate f, FLATTEN(D);
};
E = foreach (group C by (f, key)) generate group.f, TOMAP(group.key, C.val);
dump E;

输出:

(test1,[user#{(3553)}])
(test1,[friend#{(2042)}])
(test1,[system#{(262)}])
(test2,[user#{(12523),(205)}])
(test2,[friend#{(26546),(3525),(353)}])
(test2,[browser#{(firfox)}])

在完成标记和值的拆分之后,还按group标记来获取您的值袋.然后将其放入地图.请注意,这假设如果您有两行具有相同ID(test2,在此)的行,则希望将它们合并.如果不是这种情况,则需要为该行构造一个唯一的标识符.

After you've finished splitting out the tags and values, group also by the tag to get your bag of values. Then put that into a map. Note that this assumes that if you have two lines with the same id (test2, here) you want to combine them. If this isn't the case, you'll need to construct a unique identifier for the line.

不幸的是,显然没有办法不使用UDF来组合地图,但这应该是所有可能的UDF中最简单的一种.像( unested )一样:

Unfortunately, there is apparently no way to combine maps without resorting to a UDF, but this should be just about the simplest of all possible UDFs. Something like (untested):

public class COMBINE_MAPS extends EvalFunc<Map> {
    public Map<String, DataBag> exec(Tuple input) throws IOException {
        if (input == null || input.size() != 1) { return null; }

        // Input tuple is a singleton containing the bag of maps
        DataBag b = (DataBag) input.get(0);

        // Create map that we will construct and return
        Map<String, Object> m = new HashMap<String, Object>();

        // Iterate through the bag, adding the elements from each map
        Iterator<Tuple> iter = b.iterator();
        while (iter.hasNext()) {
            Tuple t = iter.next();
            m.putAll((Map<String, Object>) t.get(0));
        }

        return m;
    }
}

使用这样的UDF,您可以执行以下操作:

With a UDF like that, you can do:

F = foreach (group E by f) generate COMBINE_MAPS(E.$1);

请注意,在此UDF中,如果任何输入映射的键重叠,则一个将覆盖另一个,并且无法提前告知哪个将获胜".如果这可能是一个问题,则需要向UDF添加某种类型的错误检查代码.

Note that in this UDF, if any of the input maps have overlap in their keys, one will overwrite the other and there is no way to tell ahead of time which will "win". If this could be a problem, you would need to add some sort of error-checking code to the UDF.

这篇关于apache pig-将网址解析为地图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆