匹配两个不同列中的部分单词 [英] Matching partial words in two different columns

查看:85
本文介绍了匹配两个不同列中的部分单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正试图从我们的数据库中删除某个客户。我注意到一种趋势,那就是人们填写他们的名字,其名字与他们填写公司名称的方式不同。因此,一个示例如下所示:

  business_name first_name 
------------- ----------
锁匠taylorsville锁匠

锁匠roy locksmi

锁匠克林顿锁

锁匠farmington锁匠

这些是我不希望被拉进查询的人。他们是坏蛋。我试图用一个WHERE语句来组合一个查询(大概),这个查询隔离了任何名字中至少包含部分匹配到他们公司名字的人,但是我很困惑并且可以使用一些帮助。

解决方案

您可以使用基于相似性的方法

试试代码底部的答案

它会生成结果如下

  business_name partial_business_name名字相似性
锁匠taylorsville锁匠锁匠1.0
locksmith farmington锁匠锁匠1.0
locksmith roy locksmith locksmi 0.7777777777777778
locksmith clinton locksmith locks 0.5555555555555556

所以,你会能够根据相似性值控制要过滤的内容

**代码**

  SELECT business_name, partial_business_name,first_name,similarity FROM 
JS(//输入表

)SELECT business_name,REGEXP_EXTRACT(business_name,r'^(\ w +)')AS partial_business_name,first_name AS first_name FROM
(SELECT'locksmith taylorsville'AS business_name,'locksmith'AS first_name),
(SELECT'locksmith roy'AS business_name,'locksmi'AS first_name),
(SELECT'locksmith clinton'AS (SELECT'locksmith farmington'AS business_name,'locksmith'AS first_name),
),
//输入列
business_name,partial_business_name ,first_name,
//输出模式
[{name:'business_name',type:'string'},
{name:'partial_business_name',type:'string'},
{name:'first_name',type:'string'},
{name:'similarity',type:'float'}]

// function
函数(r,emit){

var _extend = function(dst ){
var sources = Array.prototype.slice.call(arguments,1);
for(var i = 0; i< sources.length; ++ i){
var src = sources [i];
for(var p in src){
if(src.hasOwnProperty(p))dst [p] = src [p];
}
}
return dst;
};

var Levenshtein = {
/ **
*计算两个琴弦的levenshtein距离。
*
* @param str1字符串,第一个字符串。
* @param str2第二个字符串的字符串。
* @return整数levenshtein距离(0和以上)。
* /
get:function(str1,str2){
// base cases
if(str1 === str2)return 0;
if(str1.length === 0)return str2.length;
if(str2.length === 0)return str1.length;

//两行
var prevRow = new Array(str2.length + 1),
curCol,nextCol,i,j,tmp;

//初始化上一行
for(i = 0; i< prevRow.length; ++ i){
prevRow [i] = i;
}

//计算当前行距前一行
(i = 0; i nextCol = i + 1; $ j
$ b for(j = 0; j curCol = nextCol;

// substution
nextCol = prevRow [j] +((str1.charAt(i)=== str2.charAt(j))?0:1);
//插入
tmp = curCol + 1;
if(nextCol> tmp){
nextCol = tmp;
}
//删除
tmp = prevRow [j + 1] + 1;
if(nextCol> tmp){
nextCol = tmp;
}

//将当前col值复制到previous(准备进行下一次迭代)
prevRow [j] = curCol;
}

//将last col值复制到previous(准备下一次迭代)
prevRow [j] = nextCol;
}

return nextCol;
}

};

var the_partial_business_name;

尝试{
the_partial_business_name = decodeURI(r.partial_business_name).toLowerCase();
} catch(ex){
the_partial_business_name = r.partial_business_name.toLowerCase();
}

尝试{
the_first_name = decodeURI(r.first_name).toLowerCase();
} catch(ex){
the_first_name = r.first_name.toLowerCase();
}

emit({business_name:r.business_name,partial_business_name:the_partial_business_name,first_name:the_first_name,
similarity:1 - Levenshtein.get(the_partial_business_name,the_first_name)/ the_partial_business_name.length });

}

ORDER BY相似度DESC

用于如何在Google中执行trigram操作BigQuery?,并基于 https://storage.googleapis .com / thomaspark-sandbox / udf-examples / pataky.js by @thomaspark其中Levenshtein的距离用于衡量相似性

I am working on trying to weed out a certain customer from our database. I've noticed a trend where people fill out their first name with the same name that is partial to how they fill out their company name. So an example would look like:

business_name               first_name
-------------               ----------
locksmith taylorsville      locksmith

locksmith roy               locksmi

locksmith clinton           locks

locksmith farmington        locksmith

These are people I do not want being pulled in a query. They are bad eggs. I'm trying to put together a query with a WHERE statement (presumably) that isolates anyone who has a first name that contains at least a partial match to their business name, but I'm stumped and could use some help.

解决方案

You can employ similarity based approach
Try code at bottom of answer
It produces result like below

business_name           partial_business_name   first_name  similarity   
locksmith taylorsville  locksmith               locksmith   1.0  
locksmith farmington    locksmith               locksmith   1.0  
locksmith roy           locksmith               locksmi     0.7777777777777778   
locksmith clinton       locksmith               locks       0.5555555555555556   

So, you will be able to control what to filter out based on similarity value

** Code **

SELECT business_name, partial_business_name, first_name, similarity FROM 
JS( // input table
(
  SELECT business_name, REGEXP_EXTRACT(business_name, r'^(\w+)') AS partial_business_name, first_name AS first_name FROM 
    (SELECT 'locksmith taylorsville' AS business_name, 'locksmith' AS first_name),
    (SELECT 'locksmith roy' AS business_name, 'locksmi' AS first_name),
    (SELECT 'locksmith clinton' AS business_name, 'locks' AS first_name),
    (SELECT 'locksmith farmington' AS business_name, 'locksmith' AS first_name),
) ,
// input columns
business_name, partial_business_name, first_name,
// output schema
"[{name: 'business_name', type:'string'},
  {name: 'partial_business_name', type:'string'},
  {name: 'first_name', type:'string'},
  {name: 'similarity', type:'float'}]
",
// function
"function(r, emit) {

  var _extend = function(dst) {
    var sources = Array.prototype.slice.call(arguments, 1);
    for (var i=0; i<sources.length; ++i) {
      var src = sources[i];
      for (var p in src) {
        if (src.hasOwnProperty(p)) dst[p] = src[p];
      }
    }
    return dst;
  };

  var Levenshtein = {
    /**
     * Calculate levenshtein distance of the two strings.
     *
     * @param str1 String the first string.
     * @param str2 String the second string.
     * @return Integer the levenshtein distance (0 and above).
     */
    get: function(str1, str2) {
      // base cases
      if (str1 === str2) return 0;
      if (str1.length === 0) return str2.length;
      if (str2.length === 0) return str1.length;

      // two rows
      var prevRow  = new Array(str2.length + 1),
          curCol, nextCol, i, j, tmp;

      // initialise previous row
      for (i=0; i<prevRow.length; ++i) {
        prevRow[i] = i;
      }

      // calculate current row distance from previous row
      for (i=0; i<str1.length; ++i) {
        nextCol = i + 1;

        for (j=0; j<str2.length; ++j) {
          curCol = nextCol;

          // substution
          nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 );
          // insertion
          tmp = curCol + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }
          // deletion
          tmp = prevRow[j + 1] + 1;
          if (nextCol > tmp) {
            nextCol = tmp;
          }

          // copy current col value into previous (in preparation for next iteration)
          prevRow[j] = curCol;
        }

        // copy last col value into previous (in preparation for next iteration)
        prevRow[j] = nextCol;
      }

      return nextCol;
    }

  };

  var the_partial_business_name;

  try {
    the_partial_business_name = decodeURI(r.partial_business_name).toLowerCase();
  } catch (ex) {
    the_partial_business_name = r.partial_business_name.toLowerCase();
  }

  try {
    the_first_name = decodeURI(r.first_name).toLowerCase();
  } catch (ex) {
    the_first_name = r.first_name.toLowerCase();
  }

  emit({business_name: r.business_name, partial_business_name: the_partial_business_name, first_name: the_first_name,
        similarity: 1 - Levenshtein.get(the_partial_business_name, the_first_name) / the_partial_business_name.length});

  }"
)
ORDER BY similarity DESC

Was used in How to perform trigram operations in Google BigQuery? and based on https://storage.googleapis.com/thomaspark-sandbox/udf-examples/pataky.js by @thomaspark where Levenshtein's distance is used to measure similarity

这篇关于匹配两个不同列中的部分单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆