使用Hadoop来计算独特访客的最佳方式是什么? [英] What's the best way to count unique visitors with Hadoop?

查看:104
本文介绍了使用Hadoop来计算独特访客的最佳方式是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



$ b $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ p> DATE siteID动作用户名
05-05-2010 siteA pageview jim
05-05-2010 siteB pageview tom
05-05-2010 siteA pageview jim
05-05-2010 siteB pageview bob
05-05-2010 siteA pageview mike

为每个网站,你想找出每个网站的唯一访问者?



我在想映射器会发出siteID \ t用户名
和reducer将保留每个密钥的唯一用户名集合(),然后发出该集合的长度。但是,这可能会将数百万用户名存储在内存中,这似乎不正确。任何人都有更好的方法?



我正在使用python流式传输



谢谢

解决方案

您可以将其作为一个2阶段操作: 第一步,发出(username => siteID),并让reducer使用 set 折叠多次出现的siteID。你通常比用户少得多的网站,这应该没问题。 然后在第二步中,您可以发出(siteID =>用户名)并做一个简单的计数,因为重复项已被删除。

hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this...

DATE       siteID  action   username
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview tom
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview bob
05-05-2010 siteA   pageview mike

and for each site you wanted to find out the unique visitors for each site?

I was thinking the mapper would emit siteID \t username and the reducer would keep a set() of the unique usersnames per key and then emit the length of that set. However that would be potentially storing millions of usernames in memory which doesn't seem right. Anyone have a better way?

I'm using python streaming by the way

thanks

解决方案

You could do it as a 2-stage operation:

First step, emit (username => siteID), and have the reducer just collapse multiple occurrences of siteID using a set - since you'd typically have far less sites than users, this should be fine.

Then in the second step, you can emit (siteID => username) and do a simple count, since the duplicates have been removed.

这篇关于使用Hadoop来计算独特访客的最佳方式是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆