BigQuery UDF内存在多行上超出错误,但在单行上正常工作 [英] BigQuery UDF memory exceeded error on multiple rows but works fine on single row

查看:100
本文介绍了BigQuery UDF内存在多行上超出错误,但在单行上正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个UDF来处理Google Analytics数据,并在尝试处理多行时获取UDF内存不足错误消息。我下载了原始数据并找到了最大的记录,并尝试运行我的UDF查询,并取得了成功。某些行最多有500个嵌套命中,并且命中记录的大小(迄今为每行原始GA数据的最大组成部分)确实似乎对获取错误之前可以处理的行数有影响。

例如,查询

  select 
user.ga_user_id,
ga_session_id,
...
from
temp_ga_processing(
选择
fullVisitorId,
visitNumber,
。 ...
from [79689075.ga_sessions_20160201] limit 100)

返回错误,但

 来自[79689075.ga_sessions_20160201]其中totals.hits = 500限制1)

没有。



我的印象是任何内存限制都是每行?我已经尝试了几种技术,比如在 emit(return_dict); 之前设置 row = null; (其中return_dict是处理后的数据),但无济于事。

UDF本身并不会做任何事情;我把它粘贴在这里,但它的长度大约为45 kB。它基本上做了许多事情:

 函数temp_ga_processing(row,emit){
topic_id = -1;
hit_numbers = [];
first_page_load_hits = [];
return_dict = {};
return_dict [user] = {};
return_dict [user] [ga_user_id] = row.fullVisitorId;
return_dict [ga_session_id] = row.fullVisitorId.concat( - 。concat(row.visitNumber));
for(i = 0; i hit_dict = {};
hit_dict [page] = {};
hit_dict [time] = row.hits [i] .time;
hit_dict [type] = row.hits [i] .type;
hit_dict [page] [engaged_10s] = false;
hit_dict [page] [engaged_30s] = false;
hit_dict [page] [engaged_60s] = false;

add_hit = true;
for(j = 0; j if(row.hits [i] .customDimensions [j]!= null){
if(row.hits [i] .customMetrics [j] [index] == 3){
metrics = {video_play_time:row.hits [i] .customMetrics [j] [value ]};
hit_dict [metrics] = metrics;
metrics = null;
row.hits [i] .customDimensions [j] = null;
}
}
}

hit_dict [topic] = {};
hit_dict [doctor] = {};
hit_dict [doctor_location] = {};
hit_dict [content] = {};如果(row.hits [i] .customDimensions!= null){
for(j = 0; j
b $ b if(row.hits [i] .customDimensions [j]!= null){
if(row.hits [i] .customDimensions [j] [index] == 1){
hit_dict [topic] = {name:row.hits [i] .customDimensions [j] [value]};
row.hits [i] .customDimensions [j] = null;
继续;

if(row.hits [i] .customDimensions [j] [index] == 3){
if(row.hits [i] .customDimensions [j] [ value]。search(doctor)> -1){
return_dict [logged_in_as_doctor] = true;
}
}
//等等...
}
}
}
if(row.hits [i] [ eventInfo] [eventCategory] ==page load time&& row.hits [i] [eventInfo] [eventLabel] search(OUTLIER)== -1){
elre = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i][\"eventInfo\"][\"eventLabel]);
if(elre!= null){
if(parseInt(elre [0] .split(:)[1])<= 60000){
first_page_load_hits.push(parseFloat (row.hits [I] .hitNumber));
if(hit_dict [page] [page_load] == null){
hit_dict [page] [page_load] = {};
}
hit_dict [page] [page_load] [sample] = 1;
page_load_time_re = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i][\"eventInfo\"][\"eventLabel]);
if(page_load_time_re!= null){
hit_dict [page] [page_load] [page_load_time] = parseFloat(page_load_time_re [0] .split(':')[1]) / 1000;
}
}
//等等...
}
}
row = null;
发出return_dict;
}

作业ID是真实的主要:bquijob_4c30bd3d_152fbfcd7fd

$ b $ 2016年8月更新:我们推出了一项更新,可让JavaScript工作人员使用两倍的RAM。我们将继续监测JS OOM失败的工作,看看是否有必要增加工作量;在此期间,请让我们知道你是否有进一步的工作失败与OOM。感谢!



更新:此问题与我们对UDF代码大小的限制有关。看起来V8的UDF代码的优化+重新编译过程会生成一个比我们的限制更大的数据段,但这只有在UDF运行超过足够行数时才会发生。我会在本周与V8团队会面,进一步深入细节。

@Grayson - 我能够运行您的工作整个20160201表成功;查询需要1-2分钟执行。你能证实这一点吗?






我们收到了一些类似的问题报告到#行处理。我为这个麻烦感到抱歉;我将对我们的JavaScript运行时进行一些分析,以尝试查找是否以及在哪里泄漏内存。请继续关注分析。



与此同时,如果您能够隔离导致错误的任何特定行,那也会非常有用。


I'm writing a UDF to process Google Analytics data, and getting the "UDF out of memory" error message when I try to process multiple rows. I downloaded the raw data and found the largest record and tried running my UDF query on that, with success. Some of the rows have up to 500 nested hits, and the size of the hit record (by far the largest component of each row of the raw GA data) does seem to have an effect on how many rows I can process before getting the error.

For example, the query

select 
    user.ga_user_id, 
    ga_session_id, 
        ...
from 
    temp_ga_processing(
        select 
            fullVisitorId, 
            visitNumber, 
                   ...            
        from [79689075.ga_sessions_20160201] limit 100)

returns the error, but

from [79689075.ga_sessions_20160201] where totals.hits = 500 limit 1) 

does not.

I was under the impression that any memory limitations were per-row? I've tried several techniques, such as setting row = null; before emit(return_dict); (where return_dict is the processed data) but to no avail.

The UDF itself doesn't do anything fancy; I'd paste it here but it's ~45 kB in length. It essentially does a bunch of things along the lines of:

function temp_ga_processing(row, emit) {
  topic_id = -1;
  hit_numbers = [];
  first_page_load_hits = [];
  return_dict = {};
  return_dict["user"] = {};
  return_dict["user"]["ga_user_id"] = row.fullVisitorId;
  return_dict["ga_session_id"] = row.fullVisitorId.concat("-".concat(row.visitNumber));
  for(i=0;i<row.hits.length;i++) {
    hit_dict = {};
    hit_dict["page"] = {};
    hit_dict["time"] = row.hits[i].time;
    hit_dict["type"] = row.hits[i].type;
    hit_dict["page"]["engaged_10s"] = false;
    hit_dict["page"]["engaged_30s"] = false;
    hit_dict["page"]["engaged_60s"] = false;

    add_hit = true;
    for(j=0;j<row.hits[i].customMetrics.length;j++) {
      if(row.hits[i].customDimensions[j] != null) {
        if(row.hits[i].customMetrics[j]["index"] == 3) {
          metrics = {"video_play_time": row.hits[i].customMetrics[j]["value"]};
          hit_dict["metrics"] = metrics;
          metrics = null;
          row.hits[i].customDimensions[j] = null;
        }
      }
    }

    hit_dict["topic"] = {};
    hit_dict["doctor"] = {};
    hit_dict["doctor_location"] = {};
    hit_dict["content"] = {};

    if(row.hits[i].customDimensions != null) {
      for(j=0;j<row.hits[i].customDimensions.length;j++) {
        if(row.hits[i].customDimensions[j] != null) {
          if(row.hits[i].customDimensions[j]["index"] == 1) {
            hit_dict["topic"] = {"name": row.hits[i].customDimensions[j]["value"]};
            row.hits[i].customDimensions[j] = null;
            continue;
          }
          if(row.hits[i].customDimensions[j]["index"] == 3) {
            if(row.hits[i].customDimensions[j]["value"].search("doctor") > -1) {
              return_dict["logged_in_as_doctor"] = true;
            }
          }
          // and so on...
        }
      }
    }
    if(row.hits[i]["eventInfo"]["eventCategory"] == "page load time" && row.hits[i]["eventInfo"]["eventLabel"].search("OUTLIER") == -1) {
      elre = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
      if(elre != null) {
        if(parseInt(elre[0].split(":")[1]) <= 60000) {
          first_page_load_hits.push(parseFloat(row.hits[i].hitNumber));
          if(hit_dict["page"]["page_load"] == null) {
            hit_dict["page"]["page_load"] = {};
          }
          hit_dict["page"]["page_load"]["sample"] = 1;
          page_load_time_re = /(?:onLoad|pl|page):(\d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
          if(page_load_time_re != null) {
            hit_dict["page"]["page_load"]["page_load_time"] = parseFloat(page_load_time_re[0].split(':')[1])/1000;
          }
        }
        // and so on...  
      }
    }    
  row = null;
  emit return_dict;
}

The job ID is realself-main:bquijob_4c30bd3d_152fbfcd7fd

解决方案

Update Aug 2016 : We have pushed out an update that will allow the JavaScript worker to use twice as much RAM. We will continue to monitor jobs that have failed with JS OOM to see if more increases are necessary; in the meantime, please let us know if you have further jobs failing with OOM. Thanks!

Update : this issue was related to limits we had on the size of the UDF code. It looks like V8's optimize+recompile pass of the UDF code generates a data segment that was bigger than our limits, but this was only happening when when the UDF runs over a "sufficient" number of rows. I'm meeting with the V8 team this week to dig into the details further.

@Grayson - I was able to run your job over the entire 20160201 table successfully; the query takes 1-2 minutes to execute. Could you please verify that this works on your side?


We've gotten a few reports of similar issues that seem related to # rows processed. I'm sorry for the trouble; I'll be doing some profiling on our JavaScript runtime to try to find if and where memory is being leaked. Stay tuned for the analysis.

In the meantime, if you're able to isolate any specific rows that cause the error, that would also be very helpful.

这篇关于BigQuery UDF内存在多行上超出错误,但在单行上正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆