猪udf来计算博客中的时间差异 [英] pig udf to calculate time difference in weblogs

查看:184
本文介绍了猪udf来计算博客中的时间差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一个Pig UDF计算博客中的时差?

Is there a Pig UDF that calculates time difference in the weblogs?

假设我有以下格式的博客:

Assuming I have weblogs in the below format:

10.171.100.10 - - [12/Jan/2012:14:39:46 +0530] "GET /amazon/navigator/index.php
 HTTP/1.1" 200 402 "someurl/page1" "Mozilla/4.0 (
compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.4506
.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:41:47 +0530] "GET /amazon/header.php HTTP/1.1
" 200 4376 "someurl/page2" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.450
6.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:44:15 +0530] "GET /amazon/navigator/navigator
.php HTTP/1.1" 200 912 "someurl/page3" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET
 CLR 3.0.4506.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"

IP 10.171.100.10 visited somurl / page1 at 12 / Jan / 2012:14:39:46 (weblogs中的第一项)。下一位用户在 12 / Jan / 2012:14:41:47 处访问 someurl / page2 。所以,用户停留在page1上2毫秒1秒。同样,用户停留在第2页上2分28秒(14.44:15 - 14:41.47)。我不在乎用户在第3页上停留了多久,因为我没有什么可以与之比较。输出可以是:

The user with IP 10.171.100.10 visited somurl/page1 at 12/Jan/2012:14:39:46 (1st entry in weblogs). Next user visited someurl/page2 at 12/Jan/2012:14:41:47. So, the user stayed on page1 for 2mts 1sec. Similarly user stayed on page2 for 2mts 28 secs (14.44:15 - 14:41.47). I don't care about how long the user stayed on page3 as I have nothing to compare it with. The output can be:

10.171.100.10 someurl/page1 121 sec 
10.171.100.10 someurl/page2 148 sec etc ..

该博客将拥有数百万行,IP不一定按照排序顺序排列。任何有关如何使用Pig UDF或任何其他技术来解决这个问题的建议?

The weblogs will have millions of lines and the IP's will not necessarily be in a sorted order. Any suggestions on how to go about it using Pig UDF's or any other technology?

推荐答案

我不知道任何函数默认情况下使用后续行中的内容来生成一些内容,因为序列是可变的,因此非常不可靠。

I don't know any function that would by default use the content from following rows to generate some content, as the sequence is variable and thus highly unreliable.

您必须编写自己的UDF。为了优化计算(如果有几十亿行),您可能希望 ORDER IP date ,并在 IP 中设置 GROUP 在每个IP(或IP组)上执行MapReduce作业,以确保对应于特定IP的所有行都由同一个节点处理。

You have to write your own UDF. To optimize the calculation (if you have billions of lines), you may want to ORDER by IP and date, and to GROUP your data set by IP and before starting a MapReduce job on each IP (or IP group) to ensure that all the rows corresponding to a particular IP are processed by the same node.

另外,我建议您想想要用一些时间来计算在页面上花费的时间:何时用户仍然活跃,用户何时返回?您可能会得到很长的时间范围。

Also, I would advise you to think a bit longer about the rules you want to use to calculate the time spent on a page: when is a user still active and when is a user returning? You may end up with very long time ranges.

这篇关于猪udf来计算博客中的时间差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆