如何在MySQL的两个不同表中选择相似的行(可能吗?) [英] How Can I SELECT Similar Rows in Two Different Tables in MySQL (Is It Possible?)

查看:59
本文介绍了如何在MySQL的两个不同表中选择相似的行(可能吗?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两张桌子,我想抓住所有拼写相似的学校或同一所学校的所有学校.例如:

I've got two tables and I'd like to grab all of the schools where there is a similarly spelled school or the same school. For example:

my_table_a:

my_table_a:

Olde School        
New School    
Other, C.S. School   
Main School
Too Cool for School

my_table_b:

my_table_b:

Old School
New ES    
Other School 
Main School  
Hardknocks School

是否可以编写SELECT查询,以在两个表中找到拼写相似的学校.有没有办法在列上使用LIKE或通配符?

Is it possible to write a SELECT query that will find the similarly spelled schools in the two tables. Is there a way to use LIKE or wildcards on columns?

诸如:

SELECT  my_table_a.school, my_table_b.school
FROM ` my_table_a` ,  my_table_b
WHERE  my_table_a.directory_school_name_09_10 LIKE  my_table_b.school

我用我的真实表尝试了上面的语句,而我只是得到了'='会产生的结果.

I tried the above statement with my real tables and I simply got what '=' would have produced.

基本上,我想在每个表格的列中获取前四所学校. (当然,在现实世界中,我不知道前四所学校是相似的).

Basically, I want to grab the first 4 schools in each table's column. (Of course, in the real world, I won't know that the first 4 schools are similar).

我想做的甚至是可能的吗?

Is what I'm trying to do even possible?

推荐答案

对于 Levenshtein的UDF实现距离算法,您可能需要查看" codejanitor.com:Levenshtein距离作为MySQL存储函数:

For a UDF implementation of the Levenshtein Distance algorithm you may want to check out "codejanitor.com: Levenshtein Distance as a MySQL Stored Function":

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
  DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
  DECLARE s1_char CHAR;
  DECLARE cv0, cv1 VARBINARY(256);
  SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
  IF s1 = s2 THEN
    RETURN 0;
  ELSEIF s1_len = 0 THEN
    RETURN s2_len;
  ELSEIF s2_len = 0 THEN
    RETURN s1_len;
  ELSE
    WHILE j <= s2_len DO
      SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
    END WHILE;
    WHILE i <= s1_len DO
      SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
      WHILE j <= s2_len DO
        SET c = c + 1;
        IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
        IF c > c_temp THEN SET c = c_temp; END IF;
        SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
        IF c > c_temp THEN SET c = c_temp; END IF;
        SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
      END WHILE;
      SET cv1 = cv0, i = i + 1;
    END WHILE;
  END IF;
  RETURN c;
END

现在,让我们使用问题中提供的数据来构建测试用例:

Now let's build a test case, using the data you provided in your question:

CREATE TABLE table_a (name varchar(20));
CREATE TABLE table_b (name varchar(20));

INSERT INTO table_a VALUES('Olde School');      
INSERT INTO table_a VALUES('New School');
INSERT INTO table_a VALUES('Other, C.S. School');
INSERT INTO table_a VALUES('Main School');
INSERT INTO table_a VALUES('Too Cool for School');

INSERT INTO table_b VALUES('Old School');
INSERT INTO table_b VALUES('New ES');
INSERT INTO table_b VALUES('Other School');
INSERT INTO table_b VALUES('Main School');
INSERT INTO table_b VALUES('Hardknocks School');

然后:

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (a.name = b.name);

显然返回一个与学校名称完全匹配的匹配项:

Obviously returns a match where the school names match exactly:

+---------------------+-------------+
| name                | name        |
+---------------------+-------------+
| Olde School         | NULL        |
| New School          | NULL        |
| Other, C.S. School  | NULL        |
| Main School         | Main School |
| Too Cool for School | NULL        |
+---------------------+-------------+
5 rows in set (0.00 sec)

现在,我们可以尝试使用LEVENSHTEIN函数返回具有编辑距离的学校名称不超过2个字符:

Now we can try to use the LEVENSHTEIN function to return school names that have an edit distance of 2 characters or less:

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (LEVENSHTEIN(a.name, b.name) <= 2);

+---------------------+-------------+
| name                | name        |
+---------------------+-------------+
| Olde School         | Old School  |
| New School          | NULL        |
| Other, C.S. School  | NULL        |
| Main School         | Main School |
| Too Cool for School | NULL        |
+---------------------+-------------+
5 rows in set (0.08 sec)

现在使用<= 3作为编辑距离阈值:

Now using <= 3 as an edit distance threshold:

SELECT     *
FROM       table_a a
LEFT JOIN  table_b b ON (LEVENSHTEIN(a.name, b.name) <= 3);

我们得到以下结果:

+---------------------+--------------+
| name                | name         |
+---------------------+--------------+
| Olde School         | Old School   |
| Olde School         | Other School |
| New School          | Old School   |
| Other, C.S. School  | NULL         |
| Main School         | Main School  |
| Too Cool for School | NULL         |
+---------------------+--------------+
6 rows in set (0.06 sec)

请注意,这次Olde School也与Other School相匹配,而New School也与Old School相匹配.这些可能是假阳性,并且表明定义阈值对于避免不正确的匹配非常重要.

Note how this time Olde School also matched Other School, and New School matched Old School as well. These are probably false positive, and shows that defining the threshold is very important to avoid incorrect matches.

解决此问题的一种常用技术是在应用阈值时考虑字符串的长度.实际上,我在网站上为此实现引用的内容还提供了LEVENSHTEIN_RATIO函数,该函数根据字符串的长度返回编辑差异的比率(百分比).

One common technique to tackle this problem is to take into consideration the length of the strings when applying a threshold. In fact, the site that I cited for this implementation also provides a LEVENSHTEIN_RATIO function which returns the ratio (as a percentage) of the edit difference based on the length of the strings.

这篇关于如何在MySQL的两个不同表中选择相似的行(可能吗?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆