Bigtable表现会影响专栏家庭 [英] Bigtable performance influence column families

查看:156
本文介绍了Bigtable表现会影响专栏家庭的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们目前正在研究使用多个列族对我们的bigtable查询性能的影响。我们发现将列拆分成多个列族不会提高性能。有没有人有过类似的经历?

关于基准设置的更多细节。此时,我们生产表中的每行包含大约5列,每列包含0.1到1 KB的数据。所有列都存储在一个列族中。在执行行键范围过滤器(平均返回340行)并应用列正则表达式过滤器(每行只返回1列)时,查询平均需要23,3ms。我们创建了一些测试表,我们将每行的列/数据量增加了5倍。在测试表1中,我们将所有内容都保存在一个列中。正如预期的那样,这将相同查询的查询时间增加到40.6ms。在测试表2中,我们将原始数据保存在一个列族中,但额外的数据被放入另一个列族中。当查询包含原始数据的列族(因此包含与原始表相同数量的数据)时,查询时间平均为44.3ms。因此,使用更多色谱柱系列时,性能甚至会下降。

这恰恰与我们预期的相反。例如。这是在bigtable文档中提到的( https://cloud.google.com/


将数据分组到列系列允许您从单个系列中检索数据,或多个家庭,而不是检索每一行中的所有数据。将数据尽可能地分组以获得您需要的信息,但是不会在最频繁的API调用中使用。


任何人对我们的发现有解释?



基准测试结果






(编辑:添加更多详细信息)

单行内容:

表1


  • cf1


    • col1

    • col2

    • ...

    • col25




表2


  • cf1


    • col1

    • col2

    • ..

    • col5


  • cf2


    • col6

    • col7

    • ..

    • col25

    • / ul>

      我们正在执行的基准测试是使用去客户端。调用API的代码基本如下:

        filter = bigtable.ChainFilters(bigtable.FamilyFilter(request.ColumnFamily), 
      bigtable.ColumnFilter(colPattern),bigtable.LatestNFilter(1))
      tbl:= bf.Client.Open(table)
      rr:= bigtable.NewRange(request.RowKeyStart,request。 RowKeyEnd)
      err = tbl.ReadRows(c,rr,func(row bigtable.Row)bool {return true},bigtable.RowFilter(filter))


      如果您实际上只需要具有某种特定用途的行的单元格,性能差异就会出现 - 您可以避免选择该行的所有单元格,而只是获取一个列族(通过指定在ReadRow调用过滤






      一个更重要的因素是简单地选择精确描述数据的模式。如果你这样做,任何以上类型的收益都会自然而然地发生。例如:假设你正在编写排行榜软件,并且你想存储一个玩家每场比赛击中的分数,那么你可以避免触及100列家族推荐的限制。

      和一些个人的细节。您的模式可能是:

      $ ul
      行密钥用户名
    • 列家族 user_info


      • 列限定符全名
      • password_hash

    • 列家族 game_scores


      • 列限定符 candy_royale

      • 列限定符 clash_of_tanks
      • 每一个游戏存储在game_scores列家族中的一个单独的列允许一个用户的所有分数一次获取,而不需要同时获取user_info,允许保持列族可管理的数量,允许独立于每个游戏的时间序列分数以及镜像数据性质的其他好处。


        We are currently investigating the influence of using multiple column families on the performance of our bigtable queries. We found that splitting the columns into multiple column families does not increase the performance. Does anyone have had similar experiences?

        Some more details about our benchmark setup. At this moment each row in our production table contains around 5 columns, each containing between 0,1 to 1 KB of data. All columns are stored into one column family. When performing a row key range filter (which returns on average 340 rows) and apply a column regex fitler (which returns only 1 column for each row), the query takes on average 23,3ms. We created some test tables where we increased the amount of columns/data per row by a factor 5. In test table 1, we kept everything in one column family. As expected this increased the query time of that same query to 40,6ms. In test table 2 we kept the original data in one column family, but the extra data was put into another column family. When querying the column family containing the original data (thus containing the same amount of data as the original table), the query time was on average 44,3ms. So the performance even decreased when using more column families.

        This is exactly the opposite of we would have expected. E.g. this is mentioned in the bigtable docs ( https://cloud.google.com/bigtable/docs/schema-design#column_families)

        Grouping data into column families allows you to retrieve data from a single family, or multiple families, rather than retrieving all of the data in each row. Group data as closely as you can to get just the information that you need, but no more, in your most frequent API calls.

        Anyone with an explanation for our findings?

        benchmark results


        (edit: added some more details)

        The content of a single row:

        Table 1:

        • cf1

          • col1
          • col2
          • ...
          • col25

        Table 2:

        • cf1
          • col1
          • col2
          • ..
          • col5
        • cf2
          • col6
          • col7
          • ..
          • col25

        The benchmark we are executing is using the go client. The code that calls the API looks basically as follows:

        filter = bigtable.ChainFilters(bigtable.FamilyFilter(request.ColumnFamily),
                    bigtable.ColumnFilter(colPattern), bigtable.LatestNFilter(1))
        tbl := bf.Client.Open(table)
        rr := bigtable.NewRange(request.RowKeyStart, request.RowKeyEnd)
        err = tbl.ReadRows(c, rr, func(row bigtable.Row) bool {return true}, bigtable.RowFilter(filter))
        

        解决方案

        If you are retrieving X cells per row, it does not make a major performance difference whether those cells are in X separate column families or 1 column family with X columns qualifiers.

        The performance difference comes in if you only actually need cells for a row that have some specific purpose - you can the avoid selecting all cells for the row and instead just fetch one column family (by specifying a filter on the ReadRow call)


        A more important factor is simply picking a schema that accuratly describes your data. If you do this any gain of the type above will come naturally. Also you will avoid hitting the 100 column family recommended limit.

        For example: imagine you are writing leaderboard software, and you want to store scores a player has hit for each game and some personal details. Your schema might be:

        • Row Key: username
        • Column Family user_info
          • Column Qualifier full_name
          • Column Qualifier password_hash
        • Column Family game_scores
          • Column Qualifier candy_royale
          • Column Qualifier clash_of_tanks

        Having each game stored as a separate column within the game_scores column family allows all scores for a user to be fetched at once without also fetching user_info, allows keeping the number of column families manageable, allows time series of scores for each game independently and other benefits from mirroring the nature of the data.

        这篇关于Bigtable表现会影响专栏家庭的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆