BigTable是慢还是我笨? [英] Is BigTable slow or am I dumb?

查看:196
本文介绍了BigTable是慢还是我笨?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我基本上都有经典的很多型号。用户,奖励和用户与奖励之间的多对多表格映射。



每个用户都有400个奖项,每个奖项是给予约1/2的用户。



我想迭代所有用户的奖励,并总结一下他们的观点。在SQL中,它将是多对多之间的表连接,然后遍历每个行。在一个具有MySQL实例的体面机器上,400行不应该是一个很大的事情。



在应用引擎上,我看到大约10秒钟来完成总和。大部分时间都花在Google的资料储存库中。这是cProfile的前几行

 
ncalls tottime percall cumtime percall文件名:lineno(功能)
462 6.291 0.014 6.868 0.015 {google3.apphosting.runtime._apphosting_runtime ___ python__apiproxy.Wait}
913 0.148 0.000 1.437 0.002 datastore.py:524(_FromPb)
8212 0.130 0.000 0.502 0.000 datastore_types.py:1345 (FromPropertyPb)
462 0.120 0.000 0.458 0.001 {google3.net.proto._net_proto ___ parse__python.MergeFromString}

我的数据模型是错误的吗?我在查找错误吗?这是一个缺点,我必须处理缓存和批量更新(这将是屁股的皇室疼痛)。

解决方案

可能有点两个; - )



如果您在奖表上执行400个查询,则会在映射表上查询一个返回的每个结果,那么我会期望这样做是痛苦的。查询的1000个结果限制是因为BigTable认为返回1000个结果是在合理的时间内运行的能力的极限。基于架构,我希望400个查询的速度比一个查询返回400个结果的速度慢(400 log N vs.(log M)+ 400)。



<好消息是,在GAE中,memcaching包含所有奖项和点值的单个散列表是非常简单的(当我再次看到memcache文档时,看起来很简单,我不需要



另外,如果您还不知道,用于query.fetch(1000)的结果比查询的速度更快,而且您的方式不得超过1000个结果。后者的优点是(1)如果你早点纾缓可能会更快,(2)如果Google将限制提高到1000以上,那么在没有代码更改的情况下获得利益。



删除用户(或奖励)时,您可能还有问题。我在一次测试中发现我可以在时间限制内删除300个对象。这些对象比映射对象更复杂,具有3个属性和5个索引(包括隐式索引),而映射表大概只有2个属性和2个(隐含)索引。 。



BigTable不一定做关系数据库设计的事情做得很好。相反,它可以跨许多节点分发数据。但是几乎所有的网站都运行良好,在单个数据库服务器上出现瓶颈,因此不要严格地要求BigTable做的事情。



另一件事:在单个http请求上重新执行400个数据存储查询,那么您会发现在达到您的请求固定配额之前,您会碰到数据存储区的固定配额。当然,如果你的配额很好,或者如果你先打了别的东西,那么这可能与你的应用无关。但是这两个配额之间的比例就像8:1,我以为Google希望我的数据模型看起来像这样的提示。


I basically have the classic many to many model. A user, an award, and a "many-to-many" table mapping between users and awards.

Each user has on the order of 400 awards and each award is given to about 1/2 the users.

I want to iterate over all of the user's awards and sum up their points. In SQL it would be a table join between the many-to-many and then walk through each of the rows. On a decent machine with a MySQL instance, 400 rows should not be a big deal at all.

On app engine I'm seeing around 10 seconds to do the sum. Most of the time being spent in Google's datastore. Here is the first few rows of cProfile

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      462    6.291    0.014    6.868    0.015 {google3.apphosting.runtime._apphosting_runtime___python__apiproxy.Wait}
      913    0.148    0.000    1.437    0.002 datastore.py:524(_FromPb)
     8212    0.130    0.000    0.502    0.000 datastore_types.py:1345(FromPropertyPb)
      462    0.120    0.000    0.458    0.001 {google3.net.proto._net_proto___parse__python.MergeFromString}

Is my data model wrong? Am I doing the lookups wrong? Is this a shortcoming that I have to deal with with caching and bulkupdating (which would be a royal pain in the ass).

解决方案

Could be a bit of both ;-)

If you're doing 400 queries on the Awards table, one for each result returned for a query on the mapping table, then I would expect that to be painful. The 1000-result limit on queries is there because BigTable thinks that returning 1000 results is at the limit of its ability to operate in a reasonable time. Based on the architecture, I'd expect the 400 queries to be way slower than the one query returning 400 results (400 log N vs. (log M) + 400).

The good news is that on GAE, memcaching a single hashtable containing all the awards and their points values is pretty straightforward (well, looked pretty straightforward when I cast an eye over the memcache docs a while back. I've not needed to do it yet).

Also, if you didn't already know, for result in query.fetch(1000) is way faster than for result in query, and you're restricted to 1000 results either way. The advantages of the latter are (1) it might be faster if you bail out early, and (2) if Google ever increases the limit beyond 1000, it gets the benefit without a code change.

You might also have problems when you delete a user (or an award). I found on one test that I could delete 300 objects inside the time limit. Those objects were more complex than your mapping objects, having 3 properties and 5 indices (including the implicit ones), whereas your mapping table probably only has 2 properties and 2 (implicit) indices. [Edit: just realised that I did this test before I knew that db.delete() can take a list, which is probably much faster].

BigTable does not necessarily do the things that relational databases are designed to do well. Instead, it distributes data well across many nodes. But almost all websites run fine with a bottleneck on a single db server, and hence don't strictly need the thing that BigTable does.

One other thing: if you're doing 400 datastore queries on a single http request, then you will find that you hit your datastore fixed quota well before you hit your request fixed quota. Of course if you're well within quotas, or if you're hitting something else first, then this might be irrelevant for your app. But the ratio between the two quotas is something like 8:1, and I take this as a hint what Google expects my data model to look like.

这篇关于BigTable是慢还是我笨?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆