BigQuery UDF 内存在多行上超出错误但在单行上工作正常 [英] BigQuery UDF memory exceeded error on multiple rows but works fine on single row

查看：19 发布时间：2021/12/30 22:48:02 google-bigquery

本文介绍了BigQuery UDF 内存在多行上超出错误但在单行上工作正常的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在编写 UDF 来处理 Google Analytics 数据，并在我尝试处理多行时收到UDF 内存不足"错误消息.我下载了原始数据并找到了最大的记录，并尝试在该记录上运行我的 UDF 查询，并成功.有些行有多达 500 个嵌套命中，命中记录的大小(到目前为止，原始 GA 数据每一行的最大组成部分)似乎对我在得到错误之前可以处理的行数有影响.

I'm writing a UDF to process Google Analytics data, and getting the "UDF out of memory" error message when I try to process multiple rows. I downloaded the raw data and found the largest record and tried running my UDF query on that, with success. Some of the rows have up to 500 nested hits, and the size of the hit record (by far the largest component of each row of the raw GA data) does seem to have an effect on how many rows I can process before getting the error.

例如查询

select 
    user.ga_user_id, 
    ga_session_id, 
        ...
from 
    temp_ga_processing(
        select 
            fullVisitorId, 
            visitNumber, 
                   ...            
        from [79689075.ga_sessions_20160201] limit 100)

返回错误，但是

from [79689075.ga_sessions_20160201] where totals.hits = 500 limit 1)

没有.

我的印象是每行有任何内存限制吗?我尝试了几种技术，例如在 emit(return_dict); 之前设置 row = null; (其中 return_dict 是处理过的数据)但无济于事.

I was under the impression that any memory limitations were per-row? I've tried several techniques, such as setting row = null; before emit(return_dict); (where return_dict is the processed data) but to no avail.

UDF 本身并没有做任何花哨的事情；我会把它贴在这里，但它的长度约为 45 kB.它基本上做了很多事情:

The UDF itself doesn't do anything fancy; I'd paste it here but it's ~45 kB in length. It essentially does a bunch of things along the lines of:

function temp_ga_processing(row, emit) {
  topic_id = -1;
  hit_numbers = [];
  first_page_load_hits = [];
  return_dict = {};
  return_dict["user"] = {};
  return_dict["user"]["ga_user_id"] = row.fullVisitorId;
  return_dict["ga_session_id"] = row.fullVisitorId.concat("-".concat(row.visitNumber));
  for(i=0;i<row.hits.length;i++) {
    hit_dict = {};
    hit_dict["page"] = {};
    hit_dict["time"] = row.hits[i].time;
    hit_dict["type"] = row.hits[i].type;
    hit_dict["page"]["engaged_10s"] = false;
    hit_dict["page"]["engaged_30s"] = false;
    hit_dict["page"]["engaged_60s"] = false;

    add_hit = true;
    for(j=0;j<row.hits[i].customMetrics.length;j++) {
      if(row.hits[i].customDimensions[j] != null) {
        if(row.hits[i].customMetrics[j]["index"] == 3) {
          metrics = {"video_play_time": row.hits[i].customMetrics[j]["value"]};
          hit_dict["metrics"] = metrics;
          metrics = null;
          row.hits[i].customDimensions[j] = null;
        }
      }
    }

    hit_dict["topic"] = {};
    hit_dict["doctor"] = {};
    hit_dict["doctor_location"] = {};
    hit_dict["content"] = {};

    if(row.hits[i].customDimensions != null) {
      for(j=0;j<row.hits[i].customDimensions.length;j++) {
        if(row.hits[i].customDimensions[j] != null) {
          if(row.hits[i].customDimensions[j]["index"] == 1) {
            hit_dict["topic"] = {"name": row.hits[i].customDimensions[j]["value"]};
            row.hits[i].customDimensions[j] = null;
            continue;
          }
          if(row.hits[i].customDimensions[j]["index"] == 3) {
            if(row.hits[i].customDimensions[j]["value"].search("doctor") > -1) {
              return_dict["logged_in_as_doctor"] = true;
            }
          }
          // and so on...
        }
      }
    }
    if(row.hits[i]["eventInfo"]["eventCategory"] == "page load time" && row.hits[i]["eventInfo"]["eventLabel"].search("OUTLIER") == -1) {
      elre = /(?:onLoad|pl|page):(d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
      if(elre != null) {
        if(parseInt(elre[0].split(":")[1]) <= 60000) {
          first_page_load_hits.push(parseFloat(row.hits[i].hitNumber));
          if(hit_dict["page"]["page_load"] == null) {
            hit_dict["page"]["page_load"] = {};
          }
          hit_dict["page"]["page_load"]["sample"] = 1;
          page_load_time_re = /(?:onLoad|pl|page):(d+)/.exec(row.hits[i]["eventInfo"]["eventLabel"]);
          if(page_load_time_re != null) {
            hit_dict["page"]["page_load"]["page_load_time"] = parseFloat(page_load_time_re[0].split(':')[1])/1000;
          }
        }
        // and so on...  
      }
    }    
  row = null;
  emit return_dict;
}

作业ID为realself-main:bquijob_4c30bd3d_152fbfcd7fd

The job ID is realself-main:bquijob_4c30bd3d_152fbfcd7fd

BigQuery UDF 内存在多行上超出错误但在单行上工作正常 [英] BigQuery UDF memory exceeded error on multiple rows but works fine on single row

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

BigQuery UDF 内存在多行上超出错误但在单行上工作正常 [英] BigQuery UDF memory exceeded error on multiple rows but works fine on single row

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭