REGEXP_REPLACE模式必须是const？比较BigQuery中的字符串 [英] REGEXP_REPLACE pattern has to be const? Comparing strings in BigQuery

查看：152 发布时间：2018/5/7 17:32:14 google-bigquery

本文介绍了REGEXP_REPLACE模式必须是const？比较BigQuery中的字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图用BigQuery中的Dice系数（又名Pair相似度）来测量字符串之间的相似性。一秒钟之后，我认为我可以用标准功能来做到这一点。

假设我需要比较gana和gano。然后，我会将这两个字符串预先写入'ga | an | na'和'ga | an | no'（2克列表），然后执行此操作：

  REGEXP_REPLACE（'ga | an | na'，'ga | an | no'，''）

然后根据长度变化我可以计算出我的coeff。

但是一旦应用到表格中，我就会得到：

REGEXP_REPLACE第二个参数必须是常量和非空

有没有解决方法？用简单的REPLACE（）第二个参数可以是一个字段。

也许有更好的方法来做到这一点？我知道，我可以改为使用UDF。但我想在这里避免它们。我们正在执行大型任务，UDF通常比较慢（至少在我的经验中）并受到不同的并发限制。

解决方案

REGEXP_REPLACE第二个参数必须是const和非空

Is那么有没有
的解决方法？

下面是解决上述问题的一个想法/方向，

我会将这两个字符串预先写入'ga | an | na'和
'ga | an | no '（2克列表）并执行此操作：REGEXP_REPLACE（'ga | an | na'，
'ga | an | no'，''）。然后基于长度的变化，我可以计算出我的
coeff。

变通方法是：
SELECT aw AS w1，bw AS w2，SUM（ax = bx）/ COUNT（1）AS c FROM（ SELECT w，SPLIT（p，'|'）AS x，ROW_NUMBER（）OVER（PARTITION BY w）as pos FROM （SELECT'gana'AS w，'ga | an | na' AS p））作为 JOIN（ SELECT w，SPLIT（p，'|'）AS x，ROW_NUMBER（）OVER（由w划分）作为pos FROM （SELECT'gano'AS w，'ga | an | no'AS p），（SELECT'gamo'AS w，'ga | am | mo'AS p），（SELECT'kana'AS w，'ka | an | na'AS p））AS b ON a.pos = b.pos GROUP BY w1，w2
也许有更好的方法可以做到这一点吗？
下面是一个简单的例子，说明如何在这里找到Pair相似性（包括构建bigrams集和计算系数）：
选择 a.word AS word1，b.word AS word2， 2 * SUM（a.bigram = b.bigram）/ （EXACT_COUNT_DISTINCT（a.bigram）+ EXACT_COUNT_DISTINCT（b.bigram））AS c FROM（ SELECT word，char + next_char AS bigram FROM（ SELECT word，char， LEAD（char，1）OVER（分隔字BY ORDER BY pos）AS next_char FROM（ SELECT word，SPLIT（word，''）AS char，ROW_NUMBER（）OVER（PARTITION BY word）AS pos FROM （SELECT'gana'AS word））） WHERE next_char IS NOT NULL GROUP BY 1,2 b $ b ）a CROSS JOIN（ SELECT word，char + next_char AS bigram FROM（ SELECT word，char，LEAD（char，1）OVER（PARTITION BY word ORDER BY pos ）as next_char FROM（）SELECT word，SPLIT（word，''）AS char，ROW_NUMBER（）OVER（PARTITION BY word）AS pos FROM （SELECT'gano 'AS word））） WHERE next_char IS NOT NULL GROUP BY 1，2 ）b GROUP BY 1,2 b $ b
I'm trying to measure similarity between strings using Dice's Coefficient (aka Pair Similarity) in BigQuery. For a second I thought that I can do that using just standard functions.
Suppose I need to compare "gana" and "gano". Then I would "cook" these two strings upfront into 'ga|an|na' and 'ga|an|no' (lists of 2-grams) and do this: REGEXP_REPLACE('ga|an|na', 'ga|an|no', '') Then based on change in length I can calculate my coeff. But once applied to the table I get: REGEXP_REPLACE second argument must be const and non-null Is there any workaround for that? With simple REPLACE() second argument can be a field. Maybe there is a better way to do it? I know, I can do UDF instead. But I wanted to avoid them here. We are running big tasks and UDFs are generally slower (at least in my experience) and are subject to different concurrency limit. 解决方案 REGEXP_REPLACE second argument must be const and non-null Is there any workaround for that? Below is just an idea/direction to address above question applied to logic you described: I would "cook" these two strings upfront into 'ga|an|na' and 'ga|an|no' (lists of 2-grams) and do this: REGEXP_REPLACE('ga|an|na', 'ga|an|no', ''). Then based on change in length I can calculate my coeff. The "workaround" is: SELECT a.w AS w1, b.w AS w2, SUM(a.x = b.x) / COUNT(1) AS c FROM ( SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos FROM (SELECT 'gana' AS w, 'ga|an|na' AS p) ) AS a JOIN ( SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos FROM (SELECT 'gano' AS w, 'ga|an|no' AS p), (SELECT 'gamo' AS w, 'ga|am|mo' AS p), (SELECT 'kana' AS w, 'ka|an|na' AS p) ) AS b ON a.pos = b.pos GROUP BY w1, w2 Maybe there is a better way to do it? Below is the simple example of how Pair Similarity can be approached here (including building bigrams sets and calculation of coefficient: SELECT a.word AS word1, b.word AS word2, 2 * SUM(a.bigram = b.bigram) / (EXACT_COUNT_DISTINCT(a.bigram) + EXACT_COUNT_DISTINCT(b.bigram) ) AS c FROM ( SELECT word, char + next_char AS bigram FROM ( SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char FROM ( SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos FROM (SELECT 'gana' AS word) ) ) WHERE next_char IS NOT NULL GROUP BY 1, 2 ) a CROSS JOIN ( SELECT word, char + next_char AS bigram FROM ( SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char FROM ( SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos FROM (SELECT 'gano' AS word) ) ) WHERE next_char IS NOT NULL GROUP BY 1, 2 ) b GROUP BY 1, 2 这篇关于REGEXP_REPLACE模式必须是const？比较BigQuery中的字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

REGEXP_REPLACE模式必须是const？比较BigQuery中的字符串 [英] REGEXP_REPLACE pattern has to be const? Comparing strings in BigQuery

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

REGEXP_REPLACE模式必须是const？比较BigQuery中的字符串 [英] REGEXP_REPLACE pattern has to be const? Comparing strings in BigQuery

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭