REGEXP_REPLACE模式必须是const?比较BigQuery中的字符串 [英] REGEXP_REPLACE pattern has to be const? Comparing strings in BigQuery

查看:152
本文介绍了REGEXP_REPLACE模式必须是const?比较BigQuery中的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图用BigQuery中的Dice系数(又名Pair相似度)来测量字符串之间的相似性。一秒钟之后,我认为我可以用标准功能来做到这一点。

假设我需要比较gana和gano。然后,我会将这两个字符串预先写入'ga | an | na'和'ga | an | no'(2克列表),然后执行此操作:

  REGEXP_REPLACE('ga | an | na','ga | an | no','')

然后根据长度变化我可以计算出我的coeff。



但是一旦应用到表格中,我就会得到:


REGEXP_REPLACE第二个参数必须是常量和非空


有没有解决方法?用简单的REPLACE()第二个参数可以是一个字段。

也许有更好的方法来做到这一点?我知道,我可以改为使用UDF。但我想在这里避免它们。我们正在执行大型任务,UDF通常比较慢(至少在我的经验中)并受到不同的并发限制。

解决方案


REGEXP_REPLACE第二个参数必须是const和非空

Is那么有没有
的解决方法?


下面是解决上述问题的一个想法/方向,


我会将这两个字符串预先写入'ga | an | na'和
'ga | an | no '(2克列表)并执行此操作:REGEXP_REPLACE('ga | an | na',
'ga | an | no','')。然后基于长度的变化,我可以计算出我的
coeff。

变通方法是:

  SELECT aw AS w1,bw AS w2,SUM(ax = bx)/ COUNT(1)AS c 
FROM(
SELECT w,SPLIT(p,'|')AS x,ROW_NUMBER()OVER(PARTITION BY w)as pos
FROM
(SELECT'gana'AS w,'ga | an | na' AS p)
)作为
JOIN(
SELECT w,SPLIT(p,'|')AS x,ROW_NUMBER()OVER(由w划分)作为pos
FROM
(SELECT'gano'AS w,'ga | an | no'AS p),
(SELECT'gamo'AS w,'ga | am | mo'AS p),
(SELECT'kana'AS w,'ka | an | na'AS p)
)AS b
ON a.pos = b.pos
GROUP BY w1,w2




也许有更好的方法可以做到这一点吗?


下面是一个简单的例子,说明如何在这里找到Pair相似性(包括构建bigrams集和计算系数):

 选择
a.word AS word1,b.word AS word2,
2 * SUM(a.bigram = b.bigram)/
(EXACT_COUNT_DISTINCT(a.bigram)+ EXACT_COUNT_DISTINCT(b.bigram))AS c
FROM(
SELECT word,char + next_char AS bigram
FROM(
SELECT word,char, LEAD(char,1)OVER(分隔字BY ORDER BY pos)AS next_char
FROM(
SELECT word,SPLIT(word,'')AS char,ROW_NUMBER()OVER(PARTITION BY word)AS pos
FROM
(SELECT'gana'AS word)


WHERE next_char IS NOT NULL
GROUP BY 1,2 b $ b )a
CROSS JOIN(
SELECT word,char + next_char AS bigram
FROM(
SELECT word,char,LEAD(char,1)OVER(PARTITION BY word ORDER BY pos )as next_char
FROM(
)SELECT word,SPLIT(word,'')AS char,ROW_NUMBER()OVER(PARTITION BY word)AS pos
FROM
(SELECT'gano 'AS word)


WHERE next_char IS NOT NULL
GROUP BY 1,2
)b
GROUP BY 1,2 b $ b


I'm trying to measure similarity between strings using Dice's Coefficient (aka Pair Similarity) in BigQuery. For a second I thought that I can do that using just standard functions.

Suppose I need to compare "gana" and "gano". Then I would "cook" these two strings upfront into 'ga|an|na' and 'ga|an|no' (lists of 2-grams) and do this:

REGEXP_REPLACE('ga|an|na', 'ga|an|no', '')

Then based on change in length I can calculate my coeff.

But once applied to the table I get:

REGEXP_REPLACE second argument must be const and non-null

Is there any workaround for that? With simple REPLACE() second argument can be a field.

Maybe there is a better way to do it? I know, I can do UDF instead. But I wanted to avoid them here. We are running big tasks and UDFs are generally slower (at least in my experience) and are subject to different concurrency limit.

解决方案

REGEXP_REPLACE second argument must be const and non-null
Is there any workaround for that?

Below is just an idea/direction to address above question applied to logic you described:

I would "cook" these two strings upfront into 'ga|an|na' and 'ga|an|no' (lists of 2-grams) and do this: REGEXP_REPLACE('ga|an|na', 'ga|an|no', ''). Then based on change in length I can calculate my coeff.

The "workaround" is:

SELECT a.w AS w1, b.w AS w2, SUM(a.x = b.x) / COUNT(1) AS c
FROM (
  SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
  FROM 
    (SELECT 'gana' AS w, 'ga|an|na' AS p)
) AS a
JOIN (
  SELECT w, SPLIT(p, '|') AS x, ROW_NUMBER() OVER(PARTITION BY w) AS pos
  FROM 
    (SELECT 'gano' AS w, 'ga|an|no' AS p),
    (SELECT 'gamo' AS w, 'ga|am|mo' AS p),
    (SELECT 'kana' AS w, 'ka|an|na' AS p)
) AS b
ON a.pos = b.pos
GROUP BY w1, w2  

Maybe there is a better way to do it?

Below is the simple example of how Pair Similarity can be approached here (including building bigrams sets and calculation of coefficient:

SELECT
  a.word AS word1, b.word AS word2, 
  2 * SUM(a.bigram = b.bigram) / 
    (EXACT_COUNT_DISTINCT(a.bigram) + EXACT_COUNT_DISTINCT(b.bigram) ) AS c
FROM (
  SELECT word, char + next_char AS bigram
  FROM (
    SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
    FROM (
      SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
      FROM
        (SELECT 'gana' AS word)
    )
  )
  WHERE next_char IS NOT NULL
  GROUP BY 1, 2
) a
CROSS JOIN (
  SELECT word, char + next_char AS bigram
  FROM (
    SELECT word, char, LEAD(char, 1) OVER(PARTITION BY word ORDER BY pos) AS next_char
    FROM (
      SELECT word, SPLIT(word, '') AS char, ROW_NUMBER() OVER(PARTITION BY word) AS pos
      FROM
        (SELECT 'gano' AS word)
    )
  )
  WHERE next_char IS NOT NULL
  GROUP BY 1, 2
) b
GROUP BY 1, 2

这篇关于REGEXP_REPLACE模式必须是const?比较BigQuery中的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆