提高MySQLdb加载数据文件的性能 [英] Improving MySQLdb load data infile performance

查看:69
本文介绍了提高MySQLdb加载数据文件的性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个在InnoDB中大致定义如下的表:

I have a table that's roughly defined as follows in InnoDB:

create table `my_table` (
  `time` int(10) unsigned not null,
  `key1` int(10) unsigned not null,
  `key3` char(3) unsigned not null,
  `key2` char(2) unsigned not null,
  `value1` float default null,
  `value2` float default null,
  primary key (`key1`, `key2`, `key3`, `time`),
  key (`key3`, `key2`, `key1`, `time`)
) engine=InnoDB default character set ascii
partition by range(time) (
  partition start        values less than (0),
  partition from20180101 values less than (unix_timestamp('2018-02-01')),
  partition from20180201 values less than (unix_timestamp('2018-03-01')),
  ...,
  partition future       values less than MAX_VALUE
)

是的,列顺序与键顺序不匹配.

Yes, the column order doesn't match the key order.

在Python中,我填充了一个有500,000行的DataFrame(这可能不是最有效的方法,但可以作为数据看起来像的一个示例):

In Python I'm populating a DataFrame with 500,000 rows (this is probably not the most efficient way to do this, but serves as a sample for what the data may look like):

import random
import pandas as pd
key2_values = ["aaa", "bbb", ..., "ttt"]  # 20 distinct values
key3_values = ["aa", "ab", "ac", ..., "az", "bb", "bc", ..., "by"]  # 50 distinct values
df = pd.DataFrame([], columns=["key1", "key2", "key3", "value2", "value1"])
idx = 0
for x in range(0, 500):
    for y in range(0, 20):
        for z in range(0, 50):
            df.loc[idx] = [x, key2_values[y], key3_values[z], random.random(), random.random()]
            idx += 1
df.set_index(["key1", "key2", "key3"], inplace=True)

(实际上,此DataFrame是通过几个API调用和大量数学运算填充而成的,但最终结果是相同的:一个巨大的DataFrame,具有约500,000行和与InnoDB表匹配的键)

(In reality this DataFrame is populated from several API calls and a lot of math, but the end result is the same: a huge DataFrame with ~500,000 rows and keys matching the InnoDB table)

要将这个DataFrame导入表中,我目前正在执行以下操作:

To import this DataFrame into the table, I'm currently doing the following:

import time
import MySQLdb
conn = MySQLdb.connect(local_infile=1, **connection_params)
cur = conn.cursor()
# Disable data integrity checks -- I know the data is good
cur.execute("SET foreign_key_checks=0;")
cur.execute("SET unique_checks=0;")
# Append current time to the DataFrame
df["time"] = time.time()
df.set_index(["time"], append=True, inplace=True)
# Sort data in primary key order
df.sort_index(inplace=True)
# Dump the data to a CSV
with open("dump.csv", "w") as csv:
    df.to_csv(csv)
# Load the data
cur.execute(
    """
        load data local infile 'dump.csv'
        into table `my_table`
        fields terminated by ','
        enclosed by '"'
        lines terminated by '\n'
        ignore 1 lines
        (`key1`, `key2`, `key3`, `time`, `value`)
    """
)
# Clean up
cur.execute("SET foreign_key_checks=1;")
cur.execute("SET unique_checks=1;")
conn.commit()

在所有性能上都还不错.我可以在2分钟内导入500,000行.如果可能的话,我想更快地做到这一点.

In all the performance on this isn't too bad. I can import 500,000 rows in about 2 minutes. If possible I want to get this faster.

我是否缺少任何技巧或可以进行任何更改以将其降低到30-45秒?

Are there any tricks I'm missing or any changes I could make to get this down to 30-45 seconds?

一些注意事项:

  • 我不知道在DataFrame中重新排列是否会影响性能.当前,DataFrame中的列顺序与数据库不匹配
  • 我不知道更改数据库中列的顺序以匹配主键的顺序是否会影响性能(当前,时间"排在第一位,即使它是索引的第四把键也是如此)
  • 更改数据库配置可能很困难,因为我没有直接访问数据库服务器的权限.我对已经存在的任何硬件和配置选项都感到困惑.任何性能上的改进都必须来自我的Python代码
  • 可以更改表定义(包括更改分区),但是如果可能的话,我想避免这种情况,因为已经有大量的历史数据并将其复制到另一个表中会花费很多时间.很久.丢失此数据是一种选择,但我宁愿避免
  • 我无法使用set sql_log_bin=0;,因为我没有数据库的SUPER特权
  • I don't know if reordering the columns in the DataFrame will affect performance. Currently the order of columns in the DataFrame does not match the database
  • I don't know if changing the order of the columns in the database to match the order of the primary key will affect performance (currently "time" comes first, even though it's the fourth key of the index)
  • Altering the database config could be difficult, as I don't have direct access to the database server. I'm stuck with whatever hardware and configuration options are already present. Any performance improvements must come from my Python code
  • I can change the table definition (including changing the partitioning) however I would like to avoid this if possible as there is already a large amount of historic data and copying it to another table would take a long time. Losing this data is an option, but one I'd rather avoid
  • I cannot use set sql_log_bin=0; because I do not have the SUPER privilege on the database

推荐答案

我进行了三项更改,并且我并没有停止衡量每次更改之间的效果,因此我不能 100% >可以确定每次更改的确切影响,但是我可以肯定地知道影响最大的是什么.

I've made three changes and I didn't stop to measure performance between each change, so I can't be 100% certain the exact impact of each change, however I can be reasonably sure I know what had the bigger impact.

查看我的脚本的运行方式,您可以看到我批量插入的所有500k行的time值完全相同:

Looking at how my script operates, you can see that all 500k rows I'm bulk inserting have the exact same value for time:

# Append current time to the DataFrame
df["time"] = time.time

通过将time设置为主键的最左列,意味着我要插入的所有行都将聚集在一起,而不必在表中进行拆分.

By making time the left-most column of the primary key meant that all of the rows I was inserting would be clustered together, rather than having to split them across the table.

当然,这样做的问题是它使索引对我最常见的查询无用:对于给定的key1key2key3组合(例如:SELECT * FROM my_table WHERE key1 = ... AND key2 = ... AND key3 = ...)返回所有时间"

Of course the problem with this is that it makes the index useless for my most common query: returning all "times" for a given key1, key2, and key3 combination (e.g.: SELECT * FROM my_table WHERE key1 = ... AND key2 = ... AND key3 = ...)

要解决此问题,我必须添加另一个密钥:

To fix this, I had to add another key:

PRIMARY KEY (`time`, `key1`, `key2`, `key3`),
KEY (`key1`, `key2`, `key3`)

变更2(可能有影响)-修改的列顺序

我调整了表格,使列的顺序与主键(timekey1key2key3)的顺序匹配

Change 2 (may have had an impact) -- Modified column order

I adjusted the table so that the order of the columns matched the order of the primary key (time, key1, key2, key3)

我不知道这是否有效果,但是可能有效果

I don't know if this had an effect, but it might have

我在DataFrame上运行了以下内容:

I ran the following on my DataFrame:

df.reindex(columns=["value1", "value2"], inplace=True)

这对列进行了排序,以匹配它们在数据库中出现的顺序.在此更改与更改2之间,可以完全按原样导入行,而无需交换列的顺序.我不知道这是否会对进口业绩产生影响

This sorted the columns to match the order they appeared in the database. Between this and change 2, the rows could be imported exactly as they were without needing to swap the order of columns. I don't know if that has any impact on import performance

通过这三个更改,我的导入时间从2分钟降低到了9秒! 那太不可思议了

With these three changes my import is down from 2 minutes to 9 seconds! That's absolutely incredible

我担心向表中添加额外的键,因为额外的索引意味着更长的写入时间和更多的磁盘空间,但是效果几乎可以忽略不计-尤其是与正确地对键进行群集所节省的大量费用相比.

I was worried about adding the extra key to the table since additional indexes means longer write times and more disk space, but the effect was almost negligible -- especially compared to the massive savings I got from clustering my key correctly.

这篇关于提高MySQLdb加载数据文件的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆