什么是划分一个大的哈希成红宝石N时哈希的最有效方法是什么? [英] What's the most efficient way to partition a large hash into N smaller hashes in Ruby?

查看:99
本文介绍了什么是划分一个大的哈希成红宝石N时哈希的最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题


我的工作,涉及到分片的一个问题。由于问题的一部分,我需要找到在两片或多片分区的大红宝石哈希(> 200,0000项)的最快方式。

  • 是否有任何非O(n)的方法?

  • 有没有非红宝石即C / C ++实现?

请不要使用哈希转换到一个数组和重建N个不同的哈希值的琐碎的方法的例子回答。

我担心的是,Ruby是太慢做这方面的工作。


最初的做法


这是我尝试的第一个解决方案。什么是吸引人呢是:

  • 在它没有盲目跨越散列
  • 需要循环
  • 在它并不需要管理一个计数器来均匀地分配各成员的碎片之中。
  • 这是短而整齐寻找

好吧,它不是为O(n),但它依赖于我想通会比我自己写的Ruby code更快的标准库的方法。


 透视= s.size / 2

片= s.each_sli​​ce(支点)

S1 =散列[* slices.entries [0] .flatten]

S2 =散列[* slices.entries [1] .flatten]
 


一个更好的解决方案

标记和Mike很友好地建议的方法。我不得不承认,马克的做法觉得不对劲 - 这也正是我不想要的东西 - 它环绕在所有的有成员和评估条件,因为它去 - 但因为他会花时间做评价,我想,我应该尝试类似的方法和基准,。这是他的做法我的改编版(我的钥匙不是数字,所以我不能把他的做法逐字)

 高清split_shard(S)
    shard1 = {}
    shard2 = {}


    T = Benchmark.measure做
        N = 0

        透视= s.size / 2

        s.each_pair做| K,V |
            如果n<枢
                shard1 [k]的= V
            其他
                shard2 [k]的= V
            结束

            N + = 1
        结束
    结束

    $ B + = t.real
    $ E + = s.size
    返回shard1,shard2
结束
 


结果


在这两种情况下,大量的散列被分成碎片。在所有的测试数据集的散列元素的总数为1680324。

我最初的解决方案 - 其中有因为它使用标准库的方法,最大限度地减少了红宝石code量(没有循环,没有条件),以更快 - 运行在短短的 787-9

标记的方式运行,在短短的 5S

这是一个显著胜


带走


不要被'直觉'所迷惑 - 衡量竞争算法的性能

不要担心Ruby的表现为一种语言 - 我最初担心的是,如果我做的千万这些操作,它可能需要一个显著的时间在Ruby中,但它并没有真正

由于马克和Mike谁都得到点从我的帮助。

谢谢!

解决方案

这可能是速度不够快为你的需求(这听起来像他们将需要用C扩展),但也许你可以使用哈希#选择?

我同意迈克·伍德豪斯的想法。是否有可能为你构建你的碎片,在这里原来200K项目哈希正在建设中的同一个地方?如果项目出来的数据库,你可以分割你的查询到多个不相交的查询,或者基于密钥的某些方面或反复使用类似LIMIT 10000抢一大块的时间。

其他

嗨克里斯,我只比你的方法来使用Hash#选择:

规定的基准

  S = {}
1.upto(200_000){| I | S [i] = I}

Benchmark.bm做| x |
  x.report {
    透视= s.size / 2
    片= s.each_sli​​ce(支点)
    S1 =散列[* slices.entries [0] .flatten]
    S2 =散列[* slices.entries [1] .flatten]
  }
  x.report {
    S1 = {}
    S2 = {}
    s.each_pair做| K,V |
      如果K< 100_001
        S1 [k]的= V
      其他
        S2 [k]的= V
      结束
    结束
  }
结束
 

它看起来像哈希#选择是要快得多,即使它会遍历整个大散列子散列中的每一个:

 #红宝石test.rb
      用户系统实际总
  0.560000 0.010000 0.570000(0.571401)
  0.320000 0.000000 0.320000(0.323099)
 

希望这有助于。

The Problem


I'm working on a problem that involves sharding. As part of the problem I need to find the fastest way to partition a large Ruby hash (> 200,0000 entries) in two or more pieces.

  • Are there any non O(n) approaches?

  • Is there a non-Ruby i.e. C/C++ implementation?

Please don't reply with examples using the trivial approach of converting the hash to an array and rebuilding N distinct hashes.

My concern is that Ruby is too slow to do this kind of work.


The initial approach


This was the first solution I tried. What was appealing about it was:

  • it didn't need to loop slavishly across the hash
  • it didn't need to manage a counter to allocate the members evenly among the shards.
  • it's short and neat looking

Ok, it isn't O(n) but it relies on methods in the standard library which I figured would be faster than writing my own Ruby code.


pivot = s.size / 2

slices = s.each_slice(pivot)

s1 = Hash[*slices.entries[0].flatten]

s2 = Hash[*slices.entries[1].flatten]


A better solution

Mark and Mike were kind enough to suggest approaches. I have to admit that Mark's approach felt wrong - it did exactly what I didn't want - it looped over all of the members of the has and evaluated a conditional as it went - but since he'd taken the time to do the evaluation, I figured that I should try a similar approach and benchmark that. This is my adapted version of his approach (My keys aren't numbers so I can't take his approach verbatim)

def split_shard(s)
    shard1 = {}
    shard2 = {}


    t = Benchmark.measure do
        n = 0

        pivot = s.size / 2

        s.each_pair do |k,v|
            if n < pivot
                shard1[k] = v
            else
                shard2[k] = v
            end

            n += 1
        end
    end

    $b += t.real
    $e += s.size
    return shard1, shard2
end


The results


In both cases, a large number of hashes are split into shards. The total number of elements across all of the hashes in the test data set was 1,680,324.

My initial solution - which had to be faster because it uses methods in the standard library and minimises the amount of Ruby code (no loop, no conditional) - runs in just over 9s

Mark's approach runs in just over 5s

That's a significant win


Take away


Don't be fooled by 'intuition' - measure the performance of competing algorithm

Don't worry about Ruby's performance as a language - my initial concern is that if I'm doing ten million of these operations, it could take a significant amount of time in Ruby but it doesn't really.

Thanks to Mark and Mike who both get points from me for their help.

Thanks!

解决方案

This probably isn't fast enough for your needs (which sound like they'll require an extension in C), but perhaps you could use Hash#select?

I agree with Mike Woodhouse's idea. Is it possible for you to construct your shards at the same place where the original 200k-item hash is being constructed? If the items are coming out of a database, you could split your query into multiple disjoint queries, based either on some aspect of the key or by repeatedly using something like LIMIT 10000 to grab a chunk at a time.

Additional

Hi Chris, I just compared your approach to using Hash#select:

require 'benchmark'

s = {}
1.upto(200_000) { |i| s[i] = i}

Benchmark.bm do |x|
  x.report {
    pivot = s.size / 2
    slices = s.each_slice(pivot)
    s1 = Hash[*slices.entries[0].flatten]
    s2 = Hash[*slices.entries[1].flatten]
  }
  x.report {
    s1 = {}
    s2 = {}
    s.each_pair do |k,v|
      if k < 100_001
        s1[k] = v
      else
        s2[k] = v
      end
    end
  }
end

It looks like Hash#select is much faster, even though it goes through the entire large hash for each one of the sub-hashes:

# ruby test.rb 
      user     system      total        real
  0.560000   0.010000   0.570000 (  0.571401)
  0.320000   0.000000   0.320000 (  0.323099)

Hope this helps.

这篇关于什么是划分一个大的哈希成红宝石N时哈希的最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆