猪udf来计算博客中的时间差异 [英] pig udf to calculate time difference in weblogs
问题描述
是否有一个Pig UDF计算博客中的时差?
Is there a Pig UDF that calculates time difference in the weblogs?
假设我有以下格式的博客:
Assuming I have weblogs in the below format:
10.171.100.10 - - [12/Jan/2012:14:39:46 +0530] "GET /amazon/navigator/index.php
HTTP/1.1" 200 402 "someurl/page1" "Mozilla/4.0 (
compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.4506
.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:41:47 +0530] "GET /amazon/header.php HTTP/1.1
" 200 4376 "someurl/page2" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET CLR 3.0.450
6.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
10.171.100.10 - - [12/Jan/2012:14:44:15 +0530] "GET /amazon/navigator/navigator
.php HTTP/1.1" 200 912 "someurl/page3" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; InfoPath.2; .NET
CLR 3.0.4506.2152; MS-RTC LM 8; .NET CLR 3.5.30729; .NET CLR 2.0.50727)"
IP 10.171.100.10
visited somurl / page1 at 12 / Jan / 2012:14:39:46
(weblogs中的第一项)。下一位用户在 12 / Jan / 2012:14:41:47
处访问 someurl / page2
。所以,用户停留在page1上2毫秒1秒。同样,用户停留在第2页上2分28秒(14.44:15 - 14:41.47)。我不在乎用户在第3页上停留了多久,因为我没有什么可以与之比较。输出可以是:
The user with IP 10.171.100.10
visited somurl/page1 at 12/Jan/2012:14:39:46
(1st entry in weblogs). Next user visited someurl/page2
at 12/Jan/2012:14:41:47
. So, the user stayed on page1 for 2mts 1sec. Similarly user stayed on page2 for 2mts 28 secs (14.44:15 - 14:41.47). I don't care about how long the user stayed on page3 as I have nothing to compare it with. The output can be:
10.171.100.10 someurl/page1 121 sec
10.171.100.10 someurl/page2 148 sec etc ..
该博客将拥有数百万行,IP不一定按照排序顺序排列。任何有关如何使用Pig UDF或任何其他技术来解决这个问题的建议?
The weblogs will have millions of lines and the IP's will not necessarily be in a sorted order. Any suggestions on how to go about it using Pig UDF's or any other technology?
推荐答案
我不知道任何函数默认情况下使用后续行中的内容来生成一些内容,因为序列是可变的,因此非常不可靠。
I don't know any function that would by default use the content from following rows to generate some content, as the sequence is variable and thus highly unreliable.
您必须编写自己的UDF。为了优化计算(如果有几十亿行),您可能希望 ORDER
由 IP
和 date
,并在 IP
中设置
You have to write your own UDF. To optimize the calculation (if you have billions of lines), you may want to ORDER
by IP
and date
, and to GROUP
your data set by IP
and before starting a MapReduce job on each IP (or IP group) to ensure that all the rows corresponding to a particular IP are processed by the same node.
另外,我建议您想想要用一些时间来计算在页面上花费的时间:何时用户仍然活跃,用户何时返回?您可能会得到很长的时间范围。
Also, I would advise you to think a bit longer about the rules you want to use to calculate the time spent on a page: when is a user still active and when is a user returning? You may end up with very long time ranges.
这篇关于猪udf来计算博客中的时间差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!