使用mySQL语句查找带有斜杠的几乎重复的数据 [英] mySQL statement to find nearly duplicate data with trailing slash
问题描述
我有一个名为"LINK_INFO"的表,该表的URL为"URL".问题是,除了某些URL使用尾随/来绕开唯一字段要求之外,还存在许多重复的URL.
I am have a table named 'LINK_INFO' with URLs in a field called 'URL'. The problem is, many duplicates URLs exist EXCEPT some have used a trailing / to get around the unique field requirement.
示例:
http://www.yahoo.com 和 http://www.yahoo.com/
可用来选择这些近似重复的案例的语句是什么,因此我可以删除其中之一?非常感谢您的帮助.
What is the statement I can use to select these cases of near duplicates, so I can delete one of them? Many thanks if you can help.
推荐答案
You can just use TRIM
to find all unique values;
SELECT DISTINCT TRIM(TRAILING '/' FROM url) url
FROM link_info
要立即删除重复项,只需执行删除联接;
To delete the duplicates right away, just do a delete join;
DELETE li1
FROM link_info li1
JOIN link_info li2
WHERE TRIM(TRAILING '/' FROM li1.url) =
TRIM(TRAILING '/' FROM li2.url)
AND li1.id<li2.id
在运行网上发现的任意SQL之前,总是备份您的表,甚至是我的:)
Always back up your tables before running arbitrary SQL found on the net, even mine :)
如果您的数据库计算机受到限制,则您可能希望使用索引来进行操作,并避免将其不必要地加载到内存中;
If your database machine is limited, you may want to do it using indexes and avoid loading more into memory than necessary;
-- remove all trailing slashes
UPDATE link_info
SET url=TRIM(TRAILING '/' FROM url);
-- create an index on the resulting strings (if there isn't already one)
CREATE INDEX url_index ON link_info(url);
-- delete all duplicates
DELETE li1
FROM link_info li1
JOIN link_info li2
WHERE li1.url = li2.url
AND li1.id<li2.id;
-- drop the index if not needed anymore
DROP INDEX url_index ON link_info;
还有另一个 SQLfiddle .
这篇关于使用mySQL语句查找带有斜杠的几乎重复的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!