如何在 Google BigQuery 中执行三元组运算? [英] How to perform trigram operations in Google BigQuery?

查看:35
本文介绍了如何在 Google BigQuery 中执行三元组运算?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我确实使用 PostgreSQL 中的 pg_trgm 模块来使用三元组计算两个字符串之间的相似度.特别是我使用:

similarity(text, text)

返回一个数字,表明两个参数的相似程度(0 和 1 之间).

如何在 Google BigQuery 上执行相似度函数(或等效函数)?

解决方案

试试下面的方法.至少作为增强蓝图

SELECT text1, text2,similarity FROMJS(//输入表(选择 * 从(选择米哈伊尔"作为文本1,米哈伊尔"作为文本2),(选择米哈伊尔"作为文本1,迈克"作为文本2),(选择米哈伊尔"作为文本1,迈克尔"作为文本2),(选择米哈伊尔"作为文本1,哈维尔"作为文本2),(选择米哈伊尔"作为文本1,托马斯"作为文本2)) ,//输入列文本1,文本2,//输出模式"[{name: 'text1', type:'string'},{名称:'text2',类型:'字符串'},{名称:'相似性',类型:'浮动'}]",//功能功能(r,发射){var _extend = 函数(dst){var 来源 = Array.prototype.slice.call(arguments, 1);for (var i=0; i

这是基于 https://storage.googleapis 的轻度修改.com/thomaspark-sandbox/udf-examples/pataky.js @thomaspark

I do use the pg_trgm module in PostgreSQL to calculate similarity between two strings using trigrams. Particularly I use:

similarity(text, text)

Which returns returns a number that indicates how similar the two arguments are (between 0 and 1).

How can I perform similarity function (or equivalent) on Google BigQuery?

解决方案

Try below. At least as a blueprint for enhancing

SELECT text1, text2, similarity FROM 
JS(
// input table
(
  SELECT * FROM 
  (SELECT 'mikhail' AS text1, 'mikhail' AS text2),
  (SELECT 'mikhail' AS text1, 'mike' AS text2),
  (SELECT 'mikhail' AS text1, 'michael' AS text2),
  (SELECT 'mikhail' AS text1, 'javier' AS text2),
  (SELECT 'mikhail' AS text1, 'thomas' AS text2)
) ,
// input columns
text1, text2,
// output schema
"[{name: 'text1', type:'string'},
  {name: 'text2', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_text1;

  try {
    the_text1 = decodeURI(r.text1).toLowerCase();
  } catch (ex) {
    the_text1 = r.text1.toLowerCase();
  }

  try {
    the_text2 = decodeURI(r.text2).toLowerCase();
  } catch (ex) {
    the_text2 = r.text2.toLowerCase();
  }

  emit({text1: the_text1, text2: the_text2,
        similarity: 1 - Levenshtein.get(the_text1, the_text2) / the_text1.length});

  }"
)
ORDER BY similarity DESC

This is light modification based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js by @thomaspark

这篇关于如何在 Google BigQuery 中执行三元组运算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆