为appengine设计访问/ Web统计计数器模块 [英] Designing an access/web statistics counter module for appengine
问题描述
我需要一个appengine访问统计模块来跟踪一些请求处理程序并收集统计信息给bigtable。我还没有在github上找到任何现成的解决方案,并且Google的示例要么过于简化(memcached frontpage counter with cron),要么矫枉过正(精确分片计数器)。但最重要的是,其他地方没有讨论appengine-counter解决方案,其中包括统计所需的时间组件(每小时,每日计数)。
要求:系统不需要100%准确,并且可以忽略内存缓存丢失(如果不频繁)。这应该大大简化事情。我们的想法是只使用memcache并按时间间隔积累统计信息。
$ b
UseCase :系统上的用户创建内容(例如页面)。你想跟踪约。每小时或每天用户的页面被查看的频率。有些页面经常被看到,有些则从未被看到你想按用户和时间进行查询。子页面可能有固定的ID(主页上点击率最高的用户的查询)。你可能想要删除旧的条目(Query for for entries of year = xxxx)。
class StatisticsDB(ndb.Model):
#key.id()=类似于YYYY-MM-DD-HH_groupId_countableID ...包含日期
#timeframeId = ndb.StringProperty()YYYY-MM-DD-HH如果计数器使用祖先
countableId = ndb.StringProperty(required = True)#
中的计数器名称groupId = ndb.StringProperty()#计数器组(允许带有时间帧前缀不等的单个数据库查询)
count = ndb.Integerproperty()#每个指定时间段的计数
@classmethod
def increment(class,groupID,countableID):
#increment memcache
#每小时保存到DB (请参见下文)
注意:groupId和countableId索引对于避免查询中的2个不等式是必需的。 (查询groupId / userId和chart / highcount-query的所有countables:具有最高计数的countableId派生groupId / user),但使用数据库中的祖先可能不支持图表查询。
$ b
问题是如何最好地将memcached计数器保存到DB:
- cron:这种方法在示例文档中提到(示例前台页面计数器),但使用在cron处理程序中硬编码的固定计数器ID。由于没有现有的memcache键的前缀查询,因此确定在上一个时间间隔内在memcache中创建哪些计数器id并且需要保存可能是瓶颈。
- 任务队列:如果创建计数器,则计划任务以收集它并将其写入数据库。 COST :队列处理程序保存数据时,每个使用的计数器有1个任务队列条目,每个时间粒度(例如1个小时)有一个ndb.put。看起来也是最有前途的方法,可以准确地捕捉偶发事件。在增量(id)执行时,很少有
- :如果新的时间框架启动,则保存前一个。这需要每个增量至少有2个memcache访问(获取日期,增量计数器)。一个用于跟踪时间表,另一个用于柜台。缺点:具有较长过期时间的突发计数器可能会丢失缓存。在执行增量(id)时很少出现
- :概率:如果随机%100 == 0则保存到DB,但计数器应该当增量(id)执行时,很少有统计分布的计数事件
- :如果计数器达到eg 100然后保存到DB
- 任务队列:如果创建计数器,则计划任务以收集它并将其写入数据库。 COST :队列处理程序保存数据时,每个使用的计数器有1个任务队列条目,每个时间粒度(例如1个小时)有一个ndb.put。看起来也是最有前途的方法,可以准确地捕捉偶发事件。在增量(id)执行时,很少有
有没有人解决这个问题,设计这个问题的好方法是什么?
每种方法的缺点和优势是什么?
是否有替代方法在这里丢失?
假设:计数可能会稍微不准确(缓存丢失),counterID空间很大,counterIDs会增加(有些每天一次,有些是每天一次)更新:1)我认为cron可以使用类似于任务队列的方式。只需要用memcached = True创建计数器的数据库模型,然后在cron中为所有标记的计数器运行查询。成本:1放在第一增量,在cron查询,1放到更新计数器。在没有完全考虑它的情况下,这比任务方法看起来稍微更昂贵/更复杂。
讨论其他地方:
- 高度并发的非分片计数器 a> - 每个时间段不计数
- 开放源代码GAE快速计数器 - 每个时间段都没有计数,与分片解决方案的性能比较不错,报告的memcache丢失导致的预期损失
是的,您的#2想法似乎最能满足您的需求。
要实现它,您需要执行具有指定延迟的任务。
我使用延期的图书馆
a>为此目的,使用 deferred.defer()
的倒计时
论点。在此期间,我了解到标准队列库具有类似的支持,为倒计时参数/ docs / python / refdocs / modules / google / appengine / api / taskqueue / taskqueue#Taskrel =nofollow>任务构造函数(我还没有使用这种方法,tho)。 b $ b
因此,无论何时创建一个memcache计数器,还会排入一个延迟执行任务(将有效负载传递给计数器的memcache键),这将会:
在memcache计数器到达之间,您可能会失去并发请求的增量读取任务执行并删除内存缓存计数器。您可以通过在读取它之后立即删除memcache计数器来减少这种损失,但是如果数据库更新因任何原因而失败,则可能会丢失整个计数 - 重试该任务将不再找到memcache计数器。如果这些都不令人满意,您可以进一步细化解决方案:
延迟任务:
非延迟任务现在是幂等的,可以安全地重新尝试,直到成功为止。 b
$ b
并发请求增量丢失的风险依然存在,但我认为它更小。
更新:
任务队列优于延迟库,延迟功能可以使用可选的 countdown
或 eta
参数 taskqueue.add()
$ b
倒计时 - 此任务应运行于未来的秒数或租赁。默认为零。如果
指定了eta,则不要指定此参数。
- >
eta - A
datetime.datetime
,它指定任务应该运行的绝对最早时间。如果指定了
倒数参数,则不能指定此参数。该参数可以是时间
区域感知或时区初始,或设置为过去的时间。如果
参数设置为None,则默认值为现在。对于拉取任务,
工作人员可以在eta
参数指定的时间之前租用任务。I need a access statistics module for appengine that tracks a few request-handlers and collects statistics to bigtable. I have not found any ready made solution on github and Google's examples are either oversimplified (memcached frontpage counter with cron) or overkill (accurate sharded counter). But most importantly, no appengine-counter solution discussed elsewhere includes a time component (hourly, daily counts) needed for statistics.
Requirements: The system does not need to be 100% accurate and could just ignore memcache loss (if infrequent). This should simplify things considerably. The idea is to just use memcache and accumulate stats in time intervals.
UseCase: Users on your system create content (e.g. pages). You want to track approx. How often a user's pages are viewed per hour or day. Some pages are viewed often, some never. You want to query by user and timeframe. Subpages may have fixed IDs (Query for user with most hits on homepage). You may want to delete old entries (Query for entries of year=xxxx).
class StatisticsDB(ndb.Model): # key.id() = something like YYYY-MM-DD-HH_groupId_countableID ... contains date # timeframeId = ndb.StringProperty() YYYY-MM-DD-HH needed for cleanup if counter uses ancestors countableId = ndb.StringProperty(required=True) # name of counter within group groupId = ndb.StringProperty() # counter group (allows single DB query with timeframe prefix inequality) count = ndb.Integerproperty() # count per specified timeframe @classmethod def increment(class, groupID, countableID): # increment memcache # save hourly to DB (see below)
Note: groupId and countableId indexes are necessary to avoid 2 inequalities in queries. (query all countables of a groupId/userId and chart/highcount-query: countableId with highest counts derives groupId/user), using ancestors in the DB may not support chart queries.
The problem is how to best save the memcached counter to DB:
- cron: This approach is mentioned in example docs (example front-page counter), but uses fixed counter ids that are hardcoded in the cron-handler. As there is no prefix-query for existing memcache keys, determining which counter-ids were created in memcache during the last time interval and need to be saved is probably the bottleneck.
- task-queue: if a counter is created schedule a task to collect it and write it to DB. COST: 1 task-queue entry per used counter and one ndb.put per time granularity (e.g. 1 hour) when the queue-handler saves the data. Seems the most promising approach to also capture infrequent events accurately.
- infrequently when increment(id) executes: If a new timeframe starts, save the previous. This needs at least 2 memcache accesses (get date, incr counter) per increment. One for tracking the timeframe and one for the counter. Disadvantage: bursty counters with longer stale periods may loose the cache.
- infrequently when increment(id) executes: probabilistic: if random % 100 == 0 then save to DB, but the counter should have uniformly distributed counting events
- infrequently when increment(id) executes: if the counter reaches e.g. 100 then save to DB
Did anyone solve this problem, what would be a good way to design this? What are the weaknesses and strengths of each approach? Are there alternate approaches that are missing here?
Assumptions: Counting can be slightly inaccurate (cache loss), the counterID space is large, counterIDs are incremented sparesly (some once per day, some often per day)
Update: 1) I think cron can be used similar to the task queue. One only has to create the DB model of the counter with memcached=True and run a query in cron for all counters marked that way. COST: 1 put at 1st increment, query at cron, 1 put to update counter. Without thinking it through fully this appears slightly more costly/complex than the task approach.
Discussed elsewhere:
- High concurrency non-sharded counters - no count per timeframe
- Open Source GAE Fast Counters - no count per timeframe, nice performance comparison to sharded solution, expected losses due to memcache loss reported
解决方案Yep, your #2 idea seems to best address your requirements.
To implement it you need a task execution with a specified delay.
I used the deferred library for such purpose, using the
deferred.defer()
'scountdown
argument. I learned in the meantime that the standard queue library has similar support, by specifying thecountdown
argument for a Task constructor (I have yet to use this approach, tho).So whenever you create a memcache counter also enqueue a delayed execution task (passing in its payload the counter's memcache key) which will:
- get the memcache counter value using the key from the task payload
- add the value to the corresponding db counter
- delete the memcache counter when the db update is successful
You'll probably lose the increments from concurrent requests between the moment the memcache counter is read in the task execution and the memcache counter being deleted. You could reduce such loss by deleting the memcache counter immediately after reading it, but you'd risk losing the entire count if the DB update fails for whatever reason - re-trying the task would no longer find the memcache counter. If neither of these is satisfactory you could further refine the solution:
The delayed task:
- reads the memcache counter value
- enqueues another (transactional) task (with no delay) for adding the value to the db counter
- deletes the memcache counter
The non-delayed task is now idempotent and can be safely re-tried until successful.
The risk of loss of increments from concurrent requests still exists, but I guess it's smaller.
Update:
The Task Queues are preferable to the deferred library, the deferred functionality is available using the optional
countdown
oreta
arguments to taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if you specified an eta.
eta -- A
datetime.datetime
that specifies the absolute earliest time at which the task should run. You cannot specify this argument if the countdown argument is specified. This argument can be time zone-aware or time zone-naive, or set to a time in the past. If the argument is set to None, the default value is now. For pull tasks, no worker can lease the task before the time indicated by the eta argument.
这篇关于为appengine设计访问/ Web统计计数器模块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!