匹配两个不同列中的部分单词 [英] Matching partial words in two different columns
问题描述
我正试图从我们的数据库中删除某个客户。我注意到一种趋势,那就是人们填写他们的名字,其名字与他们填写公司名称的方式不同。因此,一个示例如下所示:
business_name first_name
------------- ----------
锁匠taylorsville锁匠
锁匠roy locksmi
锁匠克林顿锁
锁匠farmington锁匠
这些是我不希望被拉进查询的人。他们是坏蛋。我试图用一个WHERE语句来组合一个查询(大概),这个查询隔离了任何名字中至少包含部分匹配到他们公司名字的人,但是我很困惑并且可以使用一些帮助。
您可以使用基于相似性的方法
试试代码底部的答案
它会生成结果如下
business_name partial_business_name名字相似性
锁匠taylorsville锁匠锁匠1.0
locksmith farmington锁匠锁匠1.0
locksmith roy locksmith locksmi 0.7777777777777778
locksmith clinton locksmith locks 0.5555555555555556
所以,你会能够根据相似性值控制要过滤的内容
**代码**
SELECT business_name, partial_business_name,first_name,similarity FROM
JS(//输入表
(
)SELECT business_name,REGEXP_EXTRACT(business_name,r'^(\ w +)')AS partial_business_name,first_name AS first_name FROM
(SELECT'locksmith taylorsville'AS business_name,'locksmith'AS first_name),
(SELECT'locksmith roy'AS business_name,'locksmi'AS first_name),
(SELECT'locksmith clinton'AS (SELECT'locksmith farmington'AS business_name,'locksmith'AS first_name),
),
//输入列
business_name,partial_business_name ,first_name,
//输出模式
[{name:'business_name',type:'string'},
{name:'partial_business_name',type:'string'},
{name:'first_name',type:'string'},
{name:'similarity',type:'float'}]
,
// function
函数(r,emit){
var _extend = function(dst ){
var sources = Array.prototype.slice.call(arguments,1);
for(var i = 0; i< sources.length; ++ i){
var src = sources [i];
for(var p in src){
if(src.hasOwnProperty(p))dst [p] = src [p];
}
}
return dst;
};
var Levenshtein = {
/ **
*计算两个琴弦的levenshtein距离。
*
* @param str1字符串,第一个字符串。
* @param str2第二个字符串的字符串。
* @return整数levenshtein距离(0和以上)。
* /
get:function(str1,str2){
// base cases
if(str1 === str2)return 0;
if(str1.length === 0)return str2.length;
if(str2.length === 0)return str1.length;
//两行
var prevRow = new Array(str2.length + 1),
curCol,nextCol,i,j,tmp;
//初始化上一行
for(i = 0; i< prevRow.length; ++ i){
prevRow [i] = i;
}
//计算当前行距前一行
(i = 0; i nextCol = i + 1; $ j
$ b for(j = 0; j curCol = nextCol;
// substution
nextCol = prevRow [j] +((str1.charAt(i)=== str2.charAt(j))?0:1);
//插入
tmp = curCol + 1;
if(nextCol> tmp){
nextCol = tmp;
}
//删除
tmp = prevRow [j + 1] + 1;
if(nextCol> tmp){
nextCol = tmp;
}
//将当前col值复制到previous(准备进行下一次迭代)
prevRow [j] = curCol;
}
//将last col值复制到previous(准备下一次迭代)
prevRow [j] = nextCol;
}
return nextCol;
}
};
var the_partial_business_name;
尝试{
the_partial_business_name = decodeURI(r.partial_business_name).toLowerCase();
} catch(ex){
the_partial_business_name = r.partial_business_name.toLowerCase();
}
尝试{
the_first_name = decodeURI(r.first_name).toLowerCase();
} catch(ex){
the_first_name = r.first_name.toLowerCase();
}
emit({business_name:r.business_name,partial_business_name:the_partial_business_name,first_name:the_first_name,
similarity:1 - Levenshtein.get(the_partial_business_name,the_first_name)/ the_partial_business_name.length });
}
)
ORDER BY相似度DESC
用于如何在Google中执行trigram操作BigQuery?,并基于 https://storage.googleapis .com / thomaspark-sandbox / udf-examples / pataky.js by @thomaspark其中Levenshtein的距离用于衡量相似性
I am working on trying to weed out a certain customer from our database. I've noticed a trend where people fill out their first name with the same name that is partial to how they fill out their company name. So an example would look like:
business_name first_name
------------- ----------
locksmith taylorsville locksmith
locksmith roy locksmi
locksmith clinton locks
locksmith farmington locksmith
These are people I do not want being pulled in a query. They are bad eggs. I'm trying to put together a query with a WHERE statement (presumably) that isolates anyone who has a first name that contains at least a partial match to their business name, but I'm stumped and could use some help.
You can employ similarity based approach
Try code at bottom of answer
It produces result like below
business_name partial_business_name first_name similarity
locksmith taylorsville locksmith locksmith 1.0
locksmith farmington locksmith locksmith 1.0
locksmith roy locksmith locksmi 0.7777777777777778
locksmith clinton locksmith locks 0.5555555555555556
So, you will be able to control what to filter out based on similarity value
** Code **
SELECT business_name, partial_business_name, first_name, similarity FROM
JS( // input table
(
SELECT business_name, REGEXP_EXTRACT(business_name, r'^(\w+)') AS partial_business_name, first_name AS first_name FROM
(SELECT 'locksmith taylorsville' AS business_name, 'locksmith' AS first_name),
(SELECT 'locksmith roy' AS business_name, 'locksmi' AS first_name),
(SELECT 'locksmith clinton' AS business_name, 'locks' AS first_name),
(SELECT 'locksmith farmington' AS business_name, 'locksmith' AS first_name),
) ,
// input columns
business_name, partial_business_name, first_name,
// output schema
"[{name: 'business_name', type:'string'},
{name: 'partial_business_name', type:'string'},
{name: 'first_name', type:'string'},
{name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {
var _extend = function(dst) {
var sources = Array.prototype.slice.call(arguments, 1);
for (var i=0; i<sources.length; ++i) {
var src = sources[i];
for (var p in src) {
if (src.hasOwnProperty(p)) dst[p] = src[p];
}
}
return dst;
};
var Levenshtein = {
/**
* Calculate levenshtein distance of the two strings.
*
* @param str1 String the first string.
* @param str2 String the second string.
* @return Integer the levenshtein distance (0 and above).
*/
get: function(str1, str2) {
// base cases
if (str1 === str2) return 0;
if (str1.length === 0) return str2.length;
if (str2.length === 0) return str1.length;
// two rows
var prevRow = new Array(str2.length + 1),
curCol, nextCol, i, j, tmp;
// initialise previous row
for (i=0; i<prevRow.length; ++i) {
prevRow[i] = i;
}
// calculate current row distance from previous row
for (i=0; i<str1.length; ++i) {
nextCol = i + 1;
for (j=0; j<str2.length; ++j) {
curCol = nextCol;
// substution
nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
// insertion
tmp = curCol + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// deletion
tmp = prevRow[j + 1] + 1;
if (nextCol > tmp) {
nextCol = tmp;
}
// copy current col value into previous (in preparation for next iteration)
prevRow[j] = curCol;
}
// copy last col value into previous (in preparation for next iteration)
prevRow[j] = nextCol;
}
return nextCol;
}
};
var the_partial_business_name;
try {
the_partial_business_name = decodeURI(r.partial_business_name).toLowerCase();
} catch (ex) {
the_partial_business_name = r.partial_business_name.toLowerCase();
}
try {
the_first_name = decodeURI(r.first_name).toLowerCase();
} catch (ex) {
the_first_name = r.first_name.toLowerCase();
}
emit({business_name: r.business_name, partial_business_name: the_partial_business_name, first_name: the_first_name,
similarity: 1 - Levenshtein.get(the_partial_business_name, the_first_name) / the_partial_business_name.length});
}"
)
ORDER BY similarity DESC
Was used in How to perform trigram operations in Google BigQuery? and based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js by @thomaspark where Levenshtein's distance is used to measure similarity
这篇关于匹配两个不同列中的部分单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!