从MySQL大表中删除重复项的最快过程是什么 [英] What is the fastest procedure to remove duplicates from a big table in MySQL

查看:89
本文介绍了从MySQL大表中删除重复项的最快过程是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在MySQL中有一张表(5000万行),新数据不断插入.

I have a table in MySQL (50 million rows) new data keep inserting periodically.

此表具有以下结构

CREATE TABLE values (
    id double NOT NULL AUTO_INCREMENT,
    channel_id int(11) NOT NULL,
    val text NOT NULL,
    date_time datetime NOT NULL,
    PRIMARY KEY (id),
    KEY channel_date_index (channel_id,date_time)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8;

两行绝不能有重复的channel_id和date_time,但是如果发生这种插入,则保持最新值很重要.

Two rows must never have duplicate channel_id and date_time, but if such insert occurs it is important to keep the newest value.

是否存在在插入之前实时检查重复项的过程,还是应该在进行不同周期的周期性重复性检查时继续插入所有数据?

Is there a procedure to check for duplicates realtime before the insert or should I keep inserting all data while doing periodic checks for duplicity in a different cycle.

这里的实时速度很重要,因为每秒要插入100次.

Realtime speed is important here, because 100 inserts occur per second.

推荐答案

为防止未来重复:

  1. KEY channel_date_index (channel_id,date_time)更改为UNIQUE (channel_id,date_time)
  2. 将该对存在时,将INSERT更改为INSERT ... ON DUPLICATE KEY UPDATE ...来更改时间戳.
  1. Change KEY channel_date_index (channel_id,date_time) to UNIQUE (channel_id,date_time)
  2. Change the INSERT to INSERT ... ON DUPLICATE KEY UPDATE ... to change the timestamp when that pair exists.

要修复现有表,可以执行ALTER IGNORE TABLE ... ADD UNIQUE(...).但这不会为您提供最新的时间戳.

To fix the existing table, you could do ALTER IGNORE TABLE ... ADD UNIQUE(...). However that would not give you the latest timestamps.

要使停机时间最短(不是最大速度),请使用pt-online-schema-change.

For minimum downtime (not maximum speed), use pt-online-schema-change.

这篇关于从MySQL大表中删除重复项的最快过程是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆