从Google App Engine中的庞大列表中计算唯一元素 [英] Calculating unique elements from huge list in Google App Engine

查看:91
本文介绍了从Google App Engine中的庞大列表中计算唯一元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个每月点击量为15,000,000的Web小部件,并且我记录了每个会话.当我想生成报告时,我想知道有多少个唯一IP.在普通的SQL中,这很容易,因为我只需执行以下操作即可:

I got a web widget with 15,000,000 hits/months and I log every session. When I want to generate a report I'd like to know how many unique IP there are. In normal SQL that would be easy as I'd just do a:

SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS)

但是由于应用引擎无法做到这一点,我现在正在研究如何实现的解决方案.不需要很快.

But as that's not possible with the app engine, I'm now looking into solutions on how to do it. It doesn't need to be fast.

我想到的一个解决方案是有一个空的Unique-IP表,然后有一个MapReduce作业来遍历所有会话实体,如果该实体的IP不在表中,我将其添加并将其添加到一个柜台.然后,我将执行另一个MapReduce作业,该作业将清除表.这会疯吗?如果是这样,您会怎么做?

A solution I was thinking of was to have an empty Unique-IP table, then have a MapReduce job to go through all session entities, if the entity's IP is not in the table I'll add it and add one to a counter. Then I'd have another MapReduce job that would clear the table. Would this be crazy? If so, how would you do it?

谢谢!

推荐答案

您建议的mapreduce方法正是您想要的.不要忘记使用事务更新任务队列任务中的记录,这将使您可以与许多映射器并行运行它.

The mapreduce approach you suggest is exactly what you want. Don't forget to use transactions to update the record in your task queue task, which will allow you to run it in parallel with many mappers.

将来,减少支持将使这成为可能,只需进行一次简单的mapreduce即可,而不会破坏您自己的事务和模型.

In future, reduce support will make this possible with a single straightforward mapreduce and no hacking around with your own transactions and models.

这篇关于从Google App Engine中的庞大列表中计算唯一元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆