通过API响应数据集进行Ruby分页会导致内存峰值 [英] Ruby paging over API response dataset causes memory spike

查看:89
本文介绍了通过API响应数据集进行Ruby分页会导致内存峰值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我翻阅API返回的数据集时,我遇到了内存高峰的问题.该API返回约15万条记录,我一次请求10万条记录,并分页浏览15页数据.数据是一个哈希数组,每个哈希包含25个键,每个键具有约50个字符的字符串值.这个过程杀死了我的512mb Heroku测功机.

I'm experiencing an issue with a large memory spike when I page through a dataset returned by an API. The API is returning ~150k records, I'm requesting 10k records at a time and paging through 15 pages of data. The data is an array of hashes, each hash containing 25 keys with ~50-character string values. This process kills my 512mb Heroku dyno.

我有一种用于分页API响应数据集的方法.

I have a method used for paging an API response dataset.

def all_pages value_key = 'values', &block
  response = {}
  values = []
  current_page = 1
  total_pages = 1
  offset = 0

  begin
    response = yield offset

    #The following seems to be the culprit
    values += response[value_key] if response.key? value_key

    offset = response['offset']
    total_pages = (response['totalResults'].to_f / response['limit'].to_f).ceil if response.key? 'totalResults'
  end while (current_page += 1) <= total_pages

  values
end

我这样称呼这个方法:

all_pages("items") do |current_page|
  get "#{data_uri}/data", query: {offset: current_page, limit: 10000}
end

我知道导致问题的原因是数组的串联,因为删除该行允许进程运行而没有内存问题.我究竟做错了什么?整个数据集可能不超过20mb-怎么消耗所有的dyno内存?我该怎么做才能提高效率?

I know it's the concatenation of the arrays that is causing the issue as removing that line allows the process to run with no memory issues. What am I doing wrong? The whole dataset is probably no larger than 20mb - how is that consuming all the dyno memory? What can I do to improve the effeciency here?

更新

响应如下所示:{"totalResults":208904,"offset":0,"count":1,"hasMore":true, limit:"10000","items":[...]}

更新2

使用report运行将显示以下内容:

Running with report shows the following:

[HTTParty] [2014-08-13 13:11:22 -0700] 200 "GET 29259/data" -
Memory 171072KB
[HTTParty] [2014-08-13 13:11:26 -0700] 200 "GET 29259/data" -
Memory 211960KB
  ... removed for brevity ...
[HTTParty] [2014-08-13 13:12:28 -0700] 200 "GET 29259/data" -
Memory 875760KB
[HTTParty] [2014-08-13 13:12:33 -0700] 200 "GET 29259/data" -
Errno::ENOMEM: Cannot allocate memory - ps ax -o pid,rss | grep -E "^[[:space:]]*23137"

更新3

我可以使用下面的基本脚本来重新创建问题.该脚本经过硬编码,只能提取10万条记录,并且在本地VM上已经消耗了超过512MB的内存.

I can recreate the issue with the basic script below. The script is hard coded to only pull 100k records and already consumes over 512MB of memory on my local VM.

#! /usr/bin/ruby
require 'uri'
require 'net/http'
require 'json'

uri = URI.parse("https://someapi.com/data")
offset = 0
values = []

begin
  http = Net::HTTP.new(uri.host, uri.port)
  http.use_ssl = true
  http.set_debug_output($stdout)

  request = Net::HTTP::Get.new(uri.request_uri + "?limit=10000&offset=#{offset}")
  request.add_field("Content-Type", "application/json")
  request.add_field("Accept", "application/json")

  response = http.request(request)
  json_response = JSON.parse(response.body)

  values << json_response['items']
  offset += 10000

end while offset < 100_000

values

更新4

我进行了一些改进,这些改进似乎有帮助,但不能完全缓解问题.

I've made a couple of improvements which seem to help but not completely alleviate the issue.

1)事实证明,使用symbolize_keys会消耗更少的内存.这是因为每个哈希的键都是相同的,将它们符号化然后将它们解析为单独的字符串要便宜些.

1) Using symbolize_keys turned out to consume less memory. This is because the keys of each hash are the same and it's cheaper to symbolize them then to parse them as seperate Strings.

2)切换到ruby-yajl进行JSON解析也会消耗更少的内存.

2) Switching to ruby-yajl for JSON parsing consumes significantly less memory as well.

处理20万条记录的内存消耗:

Memory consumption of processing 200k records:

JSON.parse(response.body):861080KB(在内存完全耗尽之前)

JSON.parse(response.body): 861080KB (Before completely running out of memory)

JSON.parse(response.body, symbolize_keys: true):573580KB

JSON.parse(response.body, symbolize_keys: true): 573580KB

Yajl::Parser.parse(response.body):357236KB

Yajl::Parser.parse(response.body): 357236KB

Yajl::Parser.parse(response.body, symbolize_keys: true):264576KB

Yajl::Parser.parse(response.body, symbolize_keys: true): 264576KB

这仍然是一个问题.

  • 为什么不超过20MB的数据集需要那么多的内存来处理?
  • 处理这样的大型数据集的正确方法"是什么?
  • 当数据集变大10倍时,该怎么办?比原来大100倍?

我将为任何能够彻底回答这三个问题的人买啤酒!

I will buy a beer for anyone who can thoroughly answer these three questions!

非常感谢.

推荐答案

您已确定问题是在数组中使用+=.因此,可能的解决方案是在不每次创建新数组的情况下添加数据.

You've identified the problem to be using += with your array. So the likely solution is to add the data without creating a new array each time.

values.push response[value_key] if response.key? value_key

或使用<<

values << response[value_key] if response.key? value_key

仅在实际需要新阵列时才应使用+=.看来您确实并不需要新的数组,但实际上只需要单个数组中的所有元素.

You should only use += if you actually want a new array. It doesn't appear you do actually want a new array, but actually just want all the elements in a single array.

这篇关于通过API响应数据集进行Ruby分页会导致内存峰值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆