如何在MySQL的两个不同表中选择相似的行(可能吗?) [英] How Can I SELECT Similar Rows in Two Different Tables in MySQL (Is It Possible?)
问题描述
我有两张桌子,我想抓住所有拼写相似的学校或同一所学校的所有学校.例如:
I've got two tables and I'd like to grab all of the schools where there is a similarly spelled school or the same school. For example:
my_table_a:
my_table_a:
Olde School
New School
Other, C.S. School
Main School
Too Cool for School
my_table_b:
my_table_b:
Old School
New ES
Other School
Main School
Hardknocks School
是否可以编写SELECT查询,以在两个表中找到拼写相似的学校.有没有办法在列上使用LIKE或通配符?
Is it possible to write a SELECT query that will find the similarly spelled schools in the two tables. Is there a way to use LIKE or wildcards on columns?
诸如:
SELECT my_table_a.school, my_table_b.school
FROM ` my_table_a` , my_table_b
WHERE my_table_a.directory_school_name_09_10 LIKE my_table_b.school
我用我的真实表尝试了上面的语句,而我只是得到了'='会产生的结果.
I tried the above statement with my real tables and I simply got what '=' would have produced.
基本上,我想在每个表格的列中获取前四所学校. (当然,在现实世界中,我不知道前四所学校是相似的).
Basically, I want to grab the first 4 schools in each table's column. (Of course, in the real world, I won't know that the first 4 schools are similar).
我想做的甚至是可能的吗?
Is what I'm trying to do even possible?
推荐答案
对于 Levenshtein的UDF实现距离算法,您可能需要查看" codejanitor.com:Levenshtein距离作为MySQL存储函数:
For a UDF implementation of the Levenshtein Distance algorithm you may want to check out "codejanitor.com: Levenshtein Distance as a MySQL Stored Function":
CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN SET c = c_temp; END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END
现在,让我们使用问题中提供的数据来构建测试用例:
Now let's build a test case, using the data you provided in your question:
CREATE TABLE table_a (name varchar(20));
CREATE TABLE table_b (name varchar(20));
INSERT INTO table_a VALUES('Olde School');
INSERT INTO table_a VALUES('New School');
INSERT INTO table_a VALUES('Other, C.S. School');
INSERT INTO table_a VALUES('Main School');
INSERT INTO table_a VALUES('Too Cool for School');
INSERT INTO table_b VALUES('Old School');
INSERT INTO table_b VALUES('New ES');
INSERT INTO table_b VALUES('Other School');
INSERT INTO table_b VALUES('Main School');
INSERT INTO table_b VALUES('Hardknocks School');
然后:
SELECT *
FROM table_a a
LEFT JOIN table_b b ON (a.name = b.name);
显然返回一个与学校名称完全匹配的匹配项:
Obviously returns a match where the school names match exactly:
+---------------------+-------------+
| name | name |
+---------------------+-------------+
| Olde School | NULL |
| New School | NULL |
| Other, C.S. School | NULL |
| Main School | Main School |
| Too Cool for School | NULL |
+---------------------+-------------+
5 rows in set (0.00 sec)
现在,我们可以尝试使用LEVENSHTEIN
函数返回具有编辑距离的学校名称不超过2个字符:
Now we can try to use the LEVENSHTEIN
function to return school names that have an edit distance of 2 characters or less:
SELECT *
FROM table_a a
LEFT JOIN table_b b ON (LEVENSHTEIN(a.name, b.name) <= 2);
+---------------------+-------------+
| name | name |
+---------------------+-------------+
| Olde School | Old School |
| New School | NULL |
| Other, C.S. School | NULL |
| Main School | Main School |
| Too Cool for School | NULL |
+---------------------+-------------+
5 rows in set (0.08 sec)
现在使用<= 3
作为编辑距离阈值:
Now using <= 3
as an edit distance threshold:
SELECT *
FROM table_a a
LEFT JOIN table_b b ON (LEVENSHTEIN(a.name, b.name) <= 3);
我们得到以下结果:
+---------------------+--------------+
| name | name |
+---------------------+--------------+
| Olde School | Old School |
| Olde School | Other School |
| New School | Old School |
| Other, C.S. School | NULL |
| Main School | Main School |
| Too Cool for School | NULL |
+---------------------+--------------+
6 rows in set (0.06 sec)
请注意,这次Olde School
也与Other School
相匹配,而New School
也与Old School
相匹配.这些可能是假阳性,并且表明定义阈值对于避免不正确的匹配非常重要.
Note how this time Olde School
also matched Other School
, and New School
matched Old School
as well. These are probably false positive, and shows that defining the threshold is very important to avoid incorrect matches.
解决此问题的一种常用技术是在应用阈值时考虑字符串的长度.实际上,我在网站上为此实现引用的内容还提供了LEVENSHTEIN_RATIO
函数,该函数根据字符串的长度返回编辑差异的比率(百分比).
One common technique to tackle this problem is to take into consideration the length of the strings when applying a threshold. In fact, the site that I cited for this implementation also provides a LEVENSHTEIN_RATIO
function which returns the ratio (as a percentage) of the edit difference based on the length of the strings.
这篇关于如何在MySQL的两个不同表中选择相似的行(可能吗?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!