使用 Hadoop 计算唯一身份访问者的最佳方法是什么? [英] What's the best way to count unique visitors with Hadoop?

查看：35 发布时间：2022/1/13 23:40:59 python hadoop mapreduce

本文介绍了使用 Hadoop 计算唯一身份访问者的最佳方法是什么?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

大家好，刚刚开始使用 hadoop，并且好奇如果您的日志文件看起来像这样，在 mapreduce 中计算唯一访问者的最佳方法是什么...

hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this...

DATE       siteID  action   username
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview tom
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview bob
05-05-2010 siteA   pageview mike

对于每个站点，您想找出每个站点的唯一访问者?

and for each site you wanted to find out the unique visitors for each site?

我在想映射器会发出 siteID 用户名并且 reducer 会为每个键保留一个唯一用户名的 set()，然后发出该集合的长度.但是，这可能会在内存中存储数百万个用户名，这似乎并不正确.谁有更好的方法?

I was thinking the mapper would emit siteID username and the reducer would keep a set() of the unique usersnames per key and then emit the length of that set. However that would be potentially storing millions of usernames in memory which doesn't seem right. Anyone have a better way?

顺便说一句，我正在使用 python 流媒体

I'm using python streaming by the way

谢谢

使用 Hadoop 计算唯一身份访问者的最佳方法是什么? [英] What's the best way to count unique visitors with Hadoop?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用 Hadoop 计算唯一身份访问者的最佳方法是什么? [英] What&#39;s the best way to count unique visitors with Hadoop?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

使用 Hadoop 计算唯一身份访问者的最佳方法是什么? [英] What's the best way to count unique visitors with Hadoop?

登录关闭