有没有办法在 Google BigQuery 中测量字符串相似度 [英] Is there a way to measure string similarity in Google BigQuery
问题描述
我想知道是否有人知道在 BigQuery 中测量字符串相似度的方法.
I'm wondering if anyone knows of a way to measure string similarity in BigQuery.
似乎是一个很好的功能.
Seems like would be a neat function to have.
我的情况是我需要比较两个 url 的相似度,以确保它们引用同一篇文章.
My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article.
我可以找到使用javascript的示例所以也许UDF是要走的路,但是我根本没有使用过 UDF(或 javascript :))
I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) )
只是想知道是否有使用现有正则表达式函数的方法,或者是否有人能让我开始将 javascript 示例移植到 UDF 中.
Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF.
非常感谢任何帮助,谢谢
Any help much appreciated, thanks
添加一些示例代码
因此,如果我将 UDF 定义为:
So if i have a UDF defined as:
// distance function
function levenshteinDistance (row, emit) {
//if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
if (typeof row.inputA === 'undefined') {var myresult = 1};
if (typeof row.inputB === 'undefined') {var myresult = 1};
//if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};
var myresult = Math.min(
levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
) + 1;
emit({outputA: myresult})
}
bigquery.defineFunction(
'levenshteinDistance', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
levenshteinDistance // Reference to JavaScript UDF
);
// make a test function to test individual parts
function test(row, emit) {
if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
emit({outputA: x});
}
bigquery.defineFunction(
'test', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
test // Reference to JavaScript UDF
);
我尝试使用以下查询进行测试:
Any i try test with a query such as:
SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))
出现错误:
错误:类型错误:无法读取第 11 行第 38-39 列未定义的属性substr"错误位置:自定义函数
Error: TypeError: Cannot read property 'substr' of undefined at line 11, columns 38-39 Error Location: User-defined function
似乎 row.inputA 可能不是字符串,或者由于某种原因字符串函数无法处理它.不确定这是类型问题还是关于 UDF 默认情况下能够使用的实用程序的有趣之处.
It seems like maybe row.inputA is not a string perhaps or for some reason string functions not able to work on it. Not sure if this is a type issue or something funny about what utils the UDF is able to use by default.
再次感谢任何帮助,谢谢.
Again any help much appreciated, thanks.
推荐答案
准备使用共享 UDF - Levenshtein distance:
Ready to use shared UDFs - Levenshtein distance:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')
6 2 0
Soundex:
SELECT fhoffa.x.soundex('felipe')
, fhoffa.x.soundex('googgle')
, fhoffa.x.soundex('guugle')
F410 G240 G240
模糊选择一个:
SELECT fhoffa.x.fuzzy_extract_one('jony'
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
#, ['john', 'johnny', 'jonathan', 'jonas']
)
johnny
操作方法:
这篇关于有没有办法在 Google BigQuery 中测量字符串相似度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!