SQL(大查询)文本相似性 [英] SQL (Big Query) Text Similarity
问题描述
我要执行以下操作: 我得到的数据也许是按列的,也许只是一个像这样的方案的字符串:
what I want to do ist the following: I get data maybe in columns maybe just one string with a scheme like that:
> 420-xyz-somefancytext-12.3.2018-etc...
> 4-20-xyz-somefancytext-12.3.2018-etc...
> 4-250-xyz-somefancyothertext-13.3.2018-etc...
> 4-230-xyz-somefancyothertext-14.3.2018-etc...
用例想检测前两行.因为第一个数字与文本和日期当然非常相似.我想到的是诸如Edit或余弦距离之类的东西来衡量这种相似性.
The use case wants to detect the the first two rows. Because the first number is very similar and the text and of course the date. What comes to my mind is something like the edit or cosine distance to measure this similarity.
我还在BigQuery中实现了一个非常简单的UDF:
I also implemented a very easy UDF in BigQuery:
CREATE TEMPORARY FUNCTION similariry(Name1 STRING, Name2 STRING)
RETURNS FLOAT64
LANGUAGE js AS """
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_Name1;
try {
the_Name1 = decodeURI(Name1).toLowerCase();
} catch (ex) {
the_Name1 = Name1.toLowerCase();
}
try {
the_Name2 = decodeURI(Name2).toLowerCase();
} catch (ex) {
the_Name2 = Name2.toLowerCase();
}
return 1 - Levenshtein.get(the_Name1, the_Name2) / the_Name1.length;
""";
WITH strings AS (
SELECT NULL string1, NULL string2 UNION ALL
SELECT 'test' string1, NULL string2 UNION ALL
SELECT NULL string1, 'test' string2 UNION ALL
SELECT 'CRATE' string1, 'TRACE' string2 UNION ALL
SELECT 'MARTHA' string1, 'MARHTA' string2 UNION ALL
SELECT 'DWAYNE' string1, 'DUANE' string2 UNION ALL
SELECT 'DIXON' string1, 'DICKSONX' string2 UNION ALL
SELECT 'Dunningham' string1, 'Cunningham' string2 UNION ALL
SELECT 'Abroms' string1, 'Abrams' string2 UNION ALL
SELECT 'Lampley' string1, 'Campley' string2 UNION ALL
SELECT 'Jonathon' string1, 'Jonathan' string2 UNION ALL
SELECT 'Jeraldine' string1, 'Gerladine' string2 UNION ALL
SELECT 'test' string1, 'blank' string2 UNION ALL
SELECT 'everybody' string1, 'every' string2 UNION ALL
SELECT 'a' string1, 'aaa' string2 UNION ALL
SELECT 'Géraldine' string1, 'Gerladine' string2 UNION ALL
SELECT 'Jérôme' string1, 'Jerome' string2 UNION ALL
SELECT 'ça' string1, 'ca' string2 UNION ALL
SELECT 'Üwe' string1, 'Uwe' string2
)
SELECT string1, string2, similariry(string1, string2) my_sim
FROM strings
ORDER BY my_sim DESC
它测量两列的相似性.但是我需要的是一种可以测量行相似度的算法.因此,这意味着我必须每隔一行检查一次.我不确定如何做以及如何以最有效的方式做.最后,应生成具有高度相似性的行表.
It measures the similarity of the two columns. But what I need is a algorithm which measures the similarity of the rows. So this means I have to check every row with every other row. I not sure how to do it and how do it the most efficient way. In the end a table of the rows with a high similarity should be generated.
推荐答案
快速概述
第1步-将表中的所有列值连接为一列
Quick outline
Step 1 - concatenate all columns values in the table into one column
例如,在行
SELECT 'Abroms' string1, 'Abrams' string2 UNION ALL
SELECT 'Lampley' string1, 'Campley' string2
应成为:
SELECT 'AbromsAbrams' cols UNION ALL
SELECT 'LampleyCampley'
值连接的逻辑可能与上面不同-但这只是为了演示方法
The logic of value concatenation can differ from above - but this is just to demonstrate approach
第2步-交叉联接表并应用所需的任何相似性函数,因此现在将整个行视为一列,然后将其与其余行进行明显比较
Step 2 - cross join table and apply whatever similarity function you want, so now you treat the whole row as a one column and obviously compare it with the rest of the rows
详细信息:
进行假设(为简单起见):没有重复的字段也没有结构-只是原始数据类型
Details:
Assumptions made (for simplicity sake) : no repeated fields and no structs - just primitive data types
我将在CTE下面使用字符串表
and I will be using below CTE for strings table,
WITH strings AS (
SELECT NULL string1, NULL string2 UNION ALL
SELECT 'test' string1, NULL string2 UNION ALL
SELECT NULL string1, 'test' string2 UNION ALL
SELECT 'CRATE' string1, 'TRACE' string2 UNION ALL
SELECT 'MARTHA' string1, 'MARHTA' string2 UNION ALL
SELECT 'DWAYNE' string1, 'DUANE' string2 UNION ALL
SELECT 'DIXON' string1, 'DICKSONX' string2 UNION ALL
SELECT 'Dunningham' string1, 'Cunningham' string2 UNION ALL
SELECT 'Abroms' string1, 'Abrams' string2 UNION ALL
SELECT 'Lampley' string1, 'Campley' string2 UNION ALL
SELECT 'Jonathon' string1, 'Jonathan' string2 UNION ALL
SELECT 'Jeraldine' string1, 'Gerladine' string2 UNION ALL
SELECT 'test' string1, 'blank' string2 UNION ALL
SELECT 'everybody' string1, 'every' string2 UNION ALL
SELECT 'a' string1, 'aaa' string2 UNION ALL
SELECT 'Géraldine' string1, 'Gerladine' string2 UNION ALL
SELECT 'Jérôme' string1, 'Jerome' string2 UNION ALL
SELECT 'ça' string1, 'ca' string2 UNION ALL
SELECT 'Üwe' string1, 'Uwe' string2
)
因此将在其余代码中将其省略
so will omit it from rest of the code
步骤1A -构建CTE以提取所有列名称并将其连接起来,这样我们就可以使用它们从结果列中清除
Step 1A - build CTE to extract all columns names and concatenate them such that then we can use them to purge from resulted column
#standardSQL
WITH columns AS (
SELECT STRING_AGG(CONCAT('"', col, '":'), '|') cols FROM (
SELECT
REPLACE(SPLIT(pair, '":')[OFFSET(0)], '"', '') col
FROM (
SELECT SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}]', ''), ',"') pairs
FROM strings t
LIMIT 1
), UNNEST(pairs) pair
)
)
SELECT *
FROM columns
结果是
Row cols
1 "string1":|"string2":
我们很快将需要
步骤1B -让我们将原始表转换为只有一列的表,该列由行中的所有值组成
Step 1B - Let's transform original table into table with just one column which consists of all values in the row
#standardSQL
CREATE TEMPORARY FUNCTION concatenate_row(row STRING, columns STRING) AS ((
REGEXP_REPLACE(REGEXP_REPLACE(row, columns, ''), '{"|"}|","', '')
));
WITH columns AS (
SELECT STRING_AGG(CONCAT('"', col, '":'), '|') cols FROM (
SELECT
REPLACE(SPLIT(pair, '":')[OFFSET(0)], '"', '') col
FROM (
SELECT SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}]', ''), ',"') pairs
FROM strings t
LIMIT 1
), UNNEST(pairs) pair
)
), lines AS (
SELECT
TO_JSON_STRING(t) original_row,
concatenate_row(TO_JSON_STRING(t), cols) pure_values
FROM strings t
CROSS JOIN columns
)
SELECT *
FROM lines
具有结果(仅显示几行...)
with result (just showing few rows ... )
Row original_row pure_values
1 {"string1":"Dunningham","string2":"Cunningham"} DunninghamCunningham
2 {"string1":"Jeraldine","string2":"Gerladine"} JeraldineGerladine
3 {"string1":"Géraldine","string2":"Gerladine"} GéraldineGerladine
4 {"string1":"Jonathon","string2":"Jonathan"} JonathonJonathan
5 {"string1":"everybody","string2":"every"} everybodyevery
最后,第2步-交叉加入并计算相似度
Finally, Step 2 - CROSS JOIN and calculating similarity
#standardSQL
SELECT
similarity(s1.pure_values, s2.pure_values) my_sim,
s1.pure_values s1,
s2.pure_values s2
FROM lines s1
CROSS JOIN lines s2
WHERE s1.pure_values < s2.pure_values
ORDER BY my_sim DESC
有结果(只显示几行...)
with result (just show few rows ...)
Row my_sim s1 s2
1 0.8888888888888888 GéraldineGerladine JeraldineGerladine
2 0.5454545454545454 test",null} {null,null}
3 0.5454545454545454 {null,"test {null,null}
4 0.5 aaaa çaca
5 0.36363636363636365 test",null} testblank
6 0.36363636363636365 DWAYNEDUANE ÜweUwe
7 0.33333333333333337 JeraldineGerladine JérômeJerome
. . .
注意:这只是您可以选择的方向,如果选择-仍有很大的改进,抛光之类的空间.
Note: this is just possible direction for yo to go and if chosen - plenty of room for improvements, polishing, etc.
因此,如果将所有内容放在一起-您将获得以下内容:
So, if put all together - below is what you get:
#standardSQL
CREATE TEMPORARY FUNCTION concatenate_row(row STRING, columns STRING) AS (
(
REGEXP_REPLACE(REGEXP_REPLACE(row, columns, ''), '{"|"}|","', '')
)
);
CREATE TEMPORARY FUNCTION similarity(Name1 STRING, Name2 STRING)
RETURNS FLOAT64
LANGUAGE js AS """
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_Name1;
try {
the_Name1 = decodeURI(Name1).toLowerCase();
} catch (ex) {
the_Name1 = Name1.toLowerCase();
}
try {
the_Name2 = decodeURI(Name2).toLowerCase();
} catch (ex) {
the_Name2 = Name2.toLowerCase();
}
return 1 - Levenshtein.get(the_Name1, the_Name2) / the_Name1.length;
""";
WITH strings AS (
SELECT NULL string1, NULL string2 UNION ALL
SELECT 'test' string1, NULL string2 UNION ALL
SELECT NULL string1, 'test' string2 UNION ALL
SELECT 'CRATE' string1, 'TRACE' string2 UNION ALL
SELECT 'MARTHA' string1, 'MARHTA' string2 UNION ALL
SELECT 'DWAYNE' string1, 'DUANE' string2 UNION ALL
SELECT 'DIXON' string1, 'DICKSONX' string2 UNION ALL
SELECT 'Dunningham' string1, 'Cunningham' string2 UNION ALL
SELECT 'Abroms' string1, 'Abrams' string2 UNION ALL
SELECT 'Lampley' string1, 'Campley' string2 UNION ALL
SELECT 'Jonathon' string1, 'Jonathan' string2 UNION ALL
SELECT 'Jeraldine' string1, 'Gerladine' string2 UNION ALL
SELECT 'test' string1, 'blank' string2 UNION ALL
SELECT 'everybody' string1, 'every' string2 UNION ALL
SELECT 'a' string1, 'aaa' string2 UNION ALL
SELECT 'Géraldine' string1, 'Gerladine' string2 UNION ALL
SELECT 'Jérôme' string1, 'Jerome' string2 UNION ALL
SELECT 'ça' string1, 'ca' string2 UNION ALL
SELECT 'Üwe' string1, 'Uwe' string2
), columns AS (
SELECT STRING_AGG(CONCAT('"', col, '":'), '|') cols FROM (
SELECT
REPLACE(SPLIT(pair, '":')[OFFSET(0)], '"', '') col
FROM (
SELECT SPLIT(REGEXP_REPLACE(TO_JSON_STRING(t), r'[{}]', ''), ',"') pairs
FROM strings t
LIMIT 1
), UNNEST(pairs) pair
)
), lines AS (
SELECT
TO_JSON_STRING(t) original_row,
concatenate_row(TO_JSON_STRING(t), cols) pure_values
FROM strings t
CROSS JOIN columns
)
SELECT
similarity(s1.pure_values, s2.pure_values) my_sim,
s1.pure_values s1,
s2.pure_values s2
FROM lines s1
CROSS JOIN lines s2
WHERE s1.pure_values < s2.pure_values
ORDER BY my_sim DESC
这篇关于SQL(大查询)文本相似性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!