在Ruby中解析非常大的JSON文件的正确方法是什么? [英] What's the proper way to parse a very large JSON file in Ruby?

查看:66
本文介绍了在Ruby中解析非常大的JSON文件的正确方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们如何在Ruby中解析json文件?

How we could parse a json file in Ruby?

require 'json'

JSON.parse File.read('data.json')

如果文件很大并且我们不想立即将其加载到内存中怎么办?那我们怎么解析呢?

What if the file is very large and we don't want to load it into memory at once? How would we parse it then?

推荐答案

由于您说不想立即将其加载到内存中,因此也许按块进行此操作更适合您.您可以检查 yajl-ffi gem来实现此目的.从他们的优势:

Since you said don't want to load it into memory at once, maybe doing this by chunks is more suitable for you. You can check yajl-ffi gem to achieve this. From their documantation:

对于较大的文档,我们可以使用IO对象将其流式传输到解析器中.我们仍然需要为解析的对象留出空间,但是文档本身永远不会被完全读取到内存中.

For larger documents, we can use an IO object to stream it into the parser. We still need room for the parsed object, but the document itself is never fully read into memory.

require 'yajl/ffi'
stream = File.open('/tmp/test.json')
obj = Yajl::FFI::Parser.parse(stream)

但是,当从磁盘或通过网络流式传输小型文档时,yajl-ruby gem将为我们提供最佳性能.

However, when streaming small documents from disk, or over the network, the yajl-ruby gem will give us the best performance.

海量文件通过网络以小块的形式到达EventMachine receive_data循环,因此Yajl::FFI是唯一适合的文件.在EventMachine::Connection子类中,我们可能有:

Huge documents arriving over the network in small chunks to an EventMachine receive_data loop is where Yajl::FFI is uniquely suited. Inside an EventMachine::Connection subclass we might have:

def post_init
  @parser = Yajl::FFI::Parser.new
  @parser.start_document { puts "start document" }
  @parser.end_document   { puts "end document" }
  @parser.start_object   { puts "start object" }
  @parser.end_object     { puts "end object" }
  @parser.start_array    { puts "start array" }
  @parser.end_array      { puts "end array" }
  @parser.key            { |k| puts "key: #{k}" }
  @parser.value          { |v| puts "value: #{v}" }
end

def receive_data(data)
  begin
    @parser << data
  rescue Yajl::FFI::ParserError => e
    close_connection
  end
end

解析器接受JSON文档的大块并解析到可用缓冲区的末尾.传入更多数据可从先前状态恢复解析.当状态发生有趣变化时,解析器会将该事件通知所有已注册的回调proc.

The parser accepts chunks of the JSON document and parses up to the end of the available buffer. Passing in more data resumes the parse from the prior state. When an interesting state change happens, the parser notifies all registered callback procs of the event.

事件回调是我们可以进行有趣的数据过滤并将其传递给其他进程的地方.上面的示例只是打印状态更改,但是回调可能会寻找一个名为行的数组,并小批量地处理这些行对象的集合.通过这种方式,可以在恒定的内存空间中处理通过网络流式传输的数百万行.

The event callback is where we can do interesting data filtering and passing to other processes. The above example simply prints state changes, but the callbacks might look for an array named rows and process sets of these row objects in small batches. Millions of rows, streaming over the network, can be processed in constant memory space this way.

这篇关于在Ruby中解析非常大的JSON文件的正确方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆