BigQuery UDF 内存在多行上超出错误但在单行上工作正常 [英] BigQuery UDF memory exceeded error on multiple rows but works fine on single row

查看:19
本文介绍了BigQuery UDF 内存在多行上超出错误但在单行上工作正常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写 UDF 来处理 Google Analytics 数据,并在我尝试处理多行时收到UDF 内存不足"错误消息.我下载了原始数据并找到了最大的记录,并尝试在该记录上运行我的 UDF 查询,并成功.有些行有多达 500 个嵌套命中,命中记录的大小(到目前为止,原始 GA 数据每一行的最大组成部分)似乎对我在得到错误之前可以处理的行数有影响.

I'm writing a UDF to process Google Analytics data, and getting the "UDF out of memory" error message when I try to process multiple rows. I downloaded the raw data and found the largest record and tried running my UDF query on that, with success. Some of the rows have up to 500 nested hits, and the size of the hit record (by far the largest component of each row of the raw GA data) does seem to have an effect on how many rows I can process before getting the error.

例如查询

select 
    user.ga_user_id, 
    ga_session_id, 
        ...
from 
    temp_ga_processing(
        select 
            fullVisitorId, 
            visitNumber, 
                   ...            
        from [79689075.ga_sessions_20160201] limit 100)

返回错误,但是

from [79689075.ga_sessions_20160201] where totals.hits = 500 limit 1) 

没有.

我的印象是每行有任何内存限制吗?我尝试了几种技术,例如在 emit(return_dict); 之前设置 row = null; (其中 return_dict 是处理过的数据)但无济于事.

I was under the impression that any memory limitations were per-row? I've tried several techniques, such as setting row = null; before emit(return_dict); (where return_dict is the processed data) but to no avail.

UDF 本身并没有做任何花哨的事情;我会把它贴在这里,但它的长度约为 45 kB.它基本上做了很多事情:

The UDF itself doesn't do anything fancy; I'd paste it here but it's ~45 kB in length. It essentially does a bunch of things along the lines of:

function temp_ga_processing(row, emit) {
  topic_id = -1;
  hit_numbers = [];
  first_page_load_hits = [];
  return_dict = {};
  return_dict["user"] = {};
  return_dict["user"]["ga_user_id"] = row.fullVisitorId;
  return_dict["ga_session_id"] = row.fullVisitorId.concat("-".concat(row.visitNumber));
  for(i=0;i<row.hits.length;i++) {
    hit_dict = {};
    hit_dict["page"] = {};
    hit_dict["time"] = row.hits[i].time;
    hit_dict["type"] = row.hits[i].type;
    hit_dict["page"]["engaged_10s"] = false;
    hit_dict["page"]["engaged_30s"] = false;
    hit_dict["page"]["engaged_60s"] = false;

    add_hit = true;
    for(j=0;j<row.hits[i].customMetrics.length;j++) {
      if(row.hits[i].customDimensions[j] != null) {
        if(row.hits[i].customMetrics[j]["index"] == 3) {
          metrics = {"video_play_time": row.hits[i].customMetrics[j]["value"]};
          hit_dict["metrics"] = metrics;
          metrics = null;
          row.hits[i].customDimensions[j] = null;
        }
      }
    }

    hit_dict["topic"] = {};
    hit_dict["doctor"] = {};
    hit_dict["doctor_location"] = {};
    hit_dict["content"] = {};

    if(row.hits[i].customDimensions != null) {
      for(j=0;j<row.hits[i].customDimensions.length;j++) {
        if(row.hits[i].customDimensions[j] != null) {
          if(row.hits[i].customDimensions[j]["index"] == 1) {
            hit_dict["topic"] = {"name": row.hits[i].customDimensions[j]["value"]};
            row.hits[i].customDimensions[j] = null;
            continue;
          }
          if(row.hits[i].customDimensions[j]["index"] == 3) {
            if(row.hits[i].customDimensions[j]["value"].search("doctor") > -1) {
              return_dict["logged_in_as_doctor"] = true;
            }
          }
          // and so on...
        }
      }
    }
    if(row.hits[i]["eventInfo"]["eventCategory"] == "page load time" && row.hits[i]["eventInfo"]["eventLabel"].search("OUTLIER") == -1) {
      elre = /(?:onLoad|pl|page):(d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
      if(elre != null) {
        if(parseInt(elre[0].split(":")[1]) <= 60000) {
          first_page_load_hits.push(parseFloat(row.hits[i].hitNumber));
          if(hit_dict["page"]["page_load"] == null) {
            hit_dict["page"]["page_load"] = {};
          }
          hit_dict["page"]["page_load"]["sample"] = 1;
          page_load_time_re = /(?:onLoad|pl|page):(d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
          if(page_load_time_re != null) {
            hit_dict["page"]["page_load"]["page_load_time"] = parseFloat(page_load_time_re[0].split(':')[1])/1000;
          }
        }
        // and so on...  
      }
    }    
  row = null;
  emit return_dict;
}

作业ID为realself-main:bquijob_4c30bd3d_152fbfcd7fd

The job ID is realself-main:bquijob_4c30bd3d_152fbfcd7fd

推荐答案

2016 年 8 月更新:我们推出了一项更新,允许 JavaScript 工作线程使用两倍的 RAM.我们将继续监控 JS OOM 失败的作业,看看是否需要更多的增加;同时,如果您还有其他工作因 OOM 而失败,请告诉我们.谢谢!

Update Aug 2016 : We have pushed out an update that will allow the JavaScript worker to use twice as much RAM. We will continue to monitor jobs that have failed with JS OOM to see if more increases are necessary; in the meantime, please let us know if you have further jobs failing with OOM. Thanks!

更新:这个问题与我们对 UDF 代码大小的限制有关.看起来 V8 对 UDF 代码的优化+重新编译生成了一个大于我们限制的数据段,但这仅在 UDF 运行足够"的行数时才会发生.我本周将与 V8 团队会面,以进一步深入了解细节.

Update : this issue was related to limits we had on the size of the UDF code. It looks like V8's optimize+recompile pass of the UDF code generates a data segment that was bigger than our limits, but this was only happening when when the UDF runs over a "sufficient" number of rows. I'm meeting with the V8 team this week to dig into the details further.

@Grayson - 我能够成功地在整个 20160201 表上运行您的工作;执行查询需要 1-2 分钟.你能验证一下这对你有用吗?

@Grayson - I was able to run your job over the entire 20160201 table successfully; the query takes 1-2 minutes to execute. Could you please verify that this works on your side?

我们收到了一些类似问题的报告,这些问题似乎与处理的 # 行有关.我很抱歉给您带来麻烦;我将对我们的 JavaScript 运行时进行一些分析,以尝试找出内存是否泄漏以及在哪里泄漏.请继续关注分析.

We've gotten a few reports of similar issues that seem related to # rows processed. I'm sorry for the trouble; I'll be doing some profiling on our JavaScript runtime to try to find if and where memory is being leaked. Stay tuned for the analysis.

与此同时,如果您能够隔离导致错误的任何特定行,那也会非常有帮助.

In the meantime, if you're able to isolate any specific rows that cause the error, that would also be very helpful.

这篇关于BigQuery UDF 内存在多行上超出错误但在单行上工作正常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆