使用 Hadoop 计算唯一身份访问者的最佳方法是什么? [英] What's the best way to count unique visitors with Hadoop?

查看:35
本文介绍了使用 Hadoop 计算唯一身份访问者的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,刚刚开始使用 hadoop,并且好奇如果您的日志文件看起来像这样,在 mapreduce 中计算唯一访问者的最佳方法是什么...

hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this...

DATE       siteID  action   username
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview tom
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview bob
05-05-2010 siteA   pageview mike

对于每个站点,您想找出每个站点的唯一访问者?

and for each site you wanted to find out the unique visitors for each site?

我在想映射器会发出 siteID 用户名并且 reducer 会为每个键保留一个唯一用户名的 set(),然后发出该集合的长度.但是,这可能会在内存中存储数百万个用户名,这似乎并不正确.谁有更好的方法?

I was thinking the mapper would emit siteID username and the reducer would keep a set() of the unique usersnames per key and then emit the length of that set. However that would be potentially storing millions of usernames in memory which doesn't seem right. Anyone have a better way?

顺便说一句,我正在使用 python 流媒体

I'm using python streaming by the way

谢谢

推荐答案

你可以把它作为一个 2-stage 操作:

You could do it as a 2-stage operation:

第一步,发出 (username => siteID),然后让 reducer 使用 set 折叠多次出现的 siteID - 因为你通常有很远网站比用户少,这应该没问题.

First step, emit (username => siteID), and have the reducer just collapse multiple occurrences of siteID using a set - since you'd typically have far less sites than users, this should be fine.

然后在第二步中,您可以发出 (siteID => username) 并进行简单的计数,因为重复项已被删除.

Then in the second step, you can emit (siteID => username) and do a simple count, since the duplicates have been removed.

这篇关于使用 Hadoop 计算唯一身份访问者的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆