猪 udf 计算博客中的时间差 [英] pig udf to calculate time difference in weblogs

查看:43
本文介绍了猪 udf 计算博客中的时间差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有可以计算博客中时差的 Pig UDF?

假设我有以下格式的博客:

10.171.100.10 - - [12/Jan/2012:14:39:46 +0530] "GET/amazon/navigator/index.phpHTTP/1.1" 200 402 "someurl/page1" "Mozilla/4.0 (兼容的;MSIE 8.0;视窗 NT 5.1;三叉戟/4.0;InfoPath.2;.NET CLR 3.0.4506.2152;MS-RTC LM 8;.NET CLR 3.5.30729;.NET CLR 2.0.50727)"10.171.100.10 - - [12/Jan/2012:14:41:47 +0530]GET/amazon/header.php HTTP/1.1" 200 4376 "someurl/page2" "Mozilla/4.0(兼容;MSIE 8.0;Windows NT 5.1;Trident/4.0;InfoPath.2;.NET CLR 3.0.4506.2152;MS-RTC LM 8;.NET CLR 3.5.30729;.NET CLR 2.0.50727)"10.171.100.10 - - [12/Jan/2012:14:44:15 +0530] "GET/amazon/navigator/navigator.php HTTP/1.1" 200 912 "someurl/page3" "Mozilla/4.0(兼容;MSIE 8.0;Windows NT 5.1;Trident/4.0;InfoPath.2;.NETCLR 3.0.4506.2152;MS-RTC LM 8;.NET CLR 3.5.30729;.NET CLR 2.0.50727)"

具有 IP 10.171.100.10 的用户在 12/Jan/2012:14:39:46 访问了 somurl/page1(博客中的第一个条目).下一个用户在 12/Jan/2012:14:41:47 访问了 someurl/page2.因此,用户在第 1 页停留了 2mts 1sec.同样,用户在 page2 上停留了 2mts 28 秒 (14.44:15 - 14:41.47).我不在乎用户在 page3 上停留了多久,因为我没有什么可以与之比较的.输出可以是:

10.171.100.10 someurl/page1 121 秒10.171.100.10 someurl/page2 148 秒等..

网络日志将有数百万行,IP 不一定按排序顺序排列.关于如何使用 Pig UDF 或任何其他技术进行操作的任何建议?

解决方案

我不知道任何函数会默认使用来自后续行的内容来生成一些内容,因为序列是可变的,因此非常不可靠.

您必须编写自己的 UDF.要优化计算(如果您有数十亿行),您可能需要按 IPdateGROUP 进行ORDER 您的数据由 IP 设置并在每个 IP(或 IP 组)上启动 MapReduce 作业之前,以确保与特定 IP 对应的所有行都由同一节点处理.>

此外,我建议您多考虑一下您想用来计算在页面上花费的时间的规则:用户何时仍处于活动状态以及用户何时返回?您最终可能会有很长的时间范围.

Is there a Pig UDF that calculates time difference in the weblogs?

Assuming I have weblogs in the below format:

10.171.100.10 - - [12/Jan/2012:14:39:46 +0530] "GET /amazon/navigator/index.php
 HTTP/1.1" 200 402 "someurl/page1" "Mozilla/4.0 (
compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.4506
.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:41:47 +0530] "GET /amazon/header.php HTTP/1.1
" 200 4376 "someurl/page2" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.450
6.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:44:15 +0530] "GET /amazon/navigator/navigator
.php HTTP/1.1" 200 912 "someurl/page3" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET
 CLR 3.0.4506.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"

The user with IP 10.171.100.10 visited somurl/page1 at 12/Jan/2012:14:39:46 (1st entry in weblogs). Next user visited someurl/page2 at 12/Jan/2012:14:41:47. So, the user stayed on page1 for 2mts 1sec. Similarly user stayed on page2 for 2mts 28 secs (14.44:15 - 14:41.47). I don't care about how long the user stayed on page3 as I have nothing to compare it with. The output can be:

10.171.100.10 someurl/page1 121 sec 
10.171.100.10 someurl/page2 148 sec etc ..

The weblogs will have millions of lines and the IP's will not necessarily be in a sorted order. Any suggestions on how to go about it using Pig UDF's or any other technology?

解决方案

I don't know any function that would by default use the content from following rows to generate some content, as the sequence is variable and thus highly unreliable.

You have to write your own UDF. To optimize the calculation (if you have billions of lines), you may want to ORDER by IP and date, and to GROUP your data set by IP and before starting a MapReduce job on each IP (or IP group) to ensure that all the rows corresponding to a particular IP are processed by the same node.

Also, I would advise you to think a bit longer about the rules you want to use to calculate the time spent on a page: when is a user still active and when is a user returning? You may end up with very long time ranges.

这篇关于猪 udf 计算博客中的时间差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆