红宝石进程之间的大型数据对象工作 [英] Working with a large data object between ruby processes

查看:160
本文介绍了红宝石进程之间的大型数据对象工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Ruby的哈希值,如果写入使用Marshal.dump的文件达到约10兆字节。经过gzip的COM pression是约500千字节。

I have a Ruby hash that reaches approximately 10 megabytes if written to a file using Marshal.dump. After gzip compression it is approximately 500 kilobytes.

通过迭代和改变这个哈希是非常快的红宝石(毫秒级)。甚至抄袭是非常快的。

Iterating through and altering this hash is very fast in ruby (fractions of a millisecond). Even copying it is extremely fast.

问题是,我需要在红宝石之间的哈希on Rails的进程共享数据。为了做到这一点使用Rails缓存(file_store或memcached的),我需要先Marshal.dump文件,但这序列化文件和一个400毫秒的延迟序列化时,当它招致1000毫秒的延迟。

The problem is that I need to share the data in this hash between Ruby on Rails processes. In order to do this using the Rails cache (file_store or memcached) I need to Marshal.dump the file first, however this incurs a 1000 millisecond delay when serializing the file and a 400 millisecond delay when serializing it.

在理想情况下我会希望能够从每个进程中在100毫秒保存和载入该散列。

Ideally I would want to be able to save and load this hash from each process in under 100 milliseconds.

一个想法是产生一个新的Ruby进程来保存这个哈希提供给其他进程修改或在其中处理数据的API,但我想避免这样做,除非我敢肯定,没有其他如何快速共享此对象。

One idea is to spawn a new Ruby process to hold this hash that provides an API to the other processes to modify or process the data within it, but I want to avoid doing this unless I'm certain that there are no other ways to share this object quickly.

有没有一种方法可以让我更直接地在进程间共享这个哈希无需序列化或反序列化吗?

Is there a way I can more directly share this hash between processes without needing to serialize or deserialize it?

下面是code我使用来产生一个类似于我正在使用的哈希:

Here is the code I'm using to generate a hash similar to the one I'm working with:

@a = []
0.upto(500) do |r|
  @a[r] = []
  0.upto(10_000) do |c|
    if rand(10) == 0 
      @a[r][c] = 1 # 10% chance of being 1
    else
      @a[r][c] = 0
    end
  end
end

@c = Marshal.dump(@a) # 1000 milliseconds
Marshal.load(@c) # 400 milliseconds

更新:

由于我原来的问题没有得到很多回应,我假设有没有解决办法那么容易,因为我本来希望。

Since my original question did not receive many responses, I'm assuming there's no solution as easy as I would have hoped.

presently我在考虑两个选项:

Presently I'm considering two options:


  1. 创建一个应用程序的Sinatra这个散列存储与一个API来修改/访问它。

  2. 创建一个C程序做同样的#1,但速度快了很多。

我的问题的范围有所增加,这样的哈希可能比我原来的例子大。所以#2可能是必要的。但我不知道在哪里书面暴露适当的API C应用方面展开。

The scope of my problem has increased such that the hash may be larger than my original example. So #2 may be necessary. But I have no idea where to start in terms of writing a C application that exposes an appropriate API.

通过如何最好地实现1号或2可能会收到最好的答案信用良好的演练。

A good walkthrough through how best to implement #1 or #2 may receive best answer credit.

更新2

我结束了实施这一写在Ruby 1.9中具有的DRb接口应用程序实例进行通信的单独的应用程序。我用的是守护宝石当Web服务器启动产卵的DRb实例。在启动中从数据库中必要的数据的的DRb应用载荷,然后将其与客户端进行通信,以返回结果并保持最新。它现在在生产环境中运行得非常好。感谢您的帮助!

I ended up implementing this as a separate application written in Ruby 1.9 that has a DRb interface to communicate with application instances. I use the Daemons gem to spawn DRb instances when the web server starts up. On start up the DRb application loads in the necessary data from the database, and then it communicates with the client to return results and to stay up to date. It's running quite well in production now. Thanks for the help!

推荐答案

一个Sinatra的应用程序将工作,但联合国{}序列化和HTML解析相比的DRb服务可能会影响性能。

A sinatra app will work, but the {un}serializing, and the HTML parsing could impact performance compared to a DRb service.

下面是一个例子,基于相关的问题,你的榜样。我使用的哈希不是数组,所以你可以使用用户ID作为索引。这样就没有必要保持这两个兴趣一个表,并在服务器上的用户的ID的表。需要注意的是利息表换位相比,你的榜样,这是你想要反正它的方式,因此它可以在一个通话进行更新。

Here's an example, based on your example in the related question. I'm using a hash instead of an array so you can use user ids as indexes. This way there is no need to keep both a table on interests and a table of user ids on the server. Note that the interest table is "transposed" compared to your example, which is the way you want it anyways, so it can be updated in one call.

# server.rb
require 'drb'

class InterestServer < Hash
  include DRbUndumped # don't send the data over!

  def closest(cur_user_id)
    cur_interests = fetch(cur_user_id)
    selected_interests = cur_interests.each_index.select{|i| cur_interests[i]}

    scores = map do |user_id, interests|
      nb_match = selected_interests.count{|i| interests[i] }
      [nb_match, user_id]
    end
    scores.sort!
  end
end

DRb.start_service nil, InterestServer.new
puts DRb.uri

DRb.thread.join


# client.rb

uri = ARGV.shift
require 'drb'
DRb.start_service
interest_server = DRbObject.new nil, uri


USERS_COUNT = 10_000
INTERESTS_COUNT = 500

# Mock users
users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} }

# Initial send over user interests
users.each do |user|
  interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 }
end

# query at will
puts interest_server.closest(users.first[:id]).inspect

# update, say there's a new user:
new_user = {:id => 42}
users << new_user
# This guy is interested in everything!
interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true } 

puts interest_server.closest(users.first[:id])[-2,2].inspect
# Will output our first user and this new user which both match perfectly

要在终端中运行,启动服务器,给输出作为参数传递给客户端:

To run in terminal, start the server and give the output as the argument to the client:

$ ruby server.rb
druby://mal.lan:51630

$ ruby client.rb druby://mal.lan:51630
[[0, 100035], ...]

[[45, 42], [45, 178902]]

这篇关于红宝石进程之间的大型数据对象工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆