提高MySQLdb加载数据文件的性能 [英] Improving MySQLdb load data infile performance

查看：69 发布时间：2020/5/16 19:27:38 python pandas innodb mysql-python

本文介绍了提高MySQLdb加载数据文件的性能的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个在InnoDB中大致定义如下的表:

I have a table that's roughly defined as follows in InnoDB:

create table `my_table` (
  `time` int(10) unsigned not null,
  `key1` int(10) unsigned not null,
  `key3` char(3) unsigned not null,
  `key2` char(2) unsigned not null,
  `value1` float default null,
  `value2` float default null,
  primary key (`key1`, `key2`, `key3`, `time`),
  key (`key3`, `key2`, `key1`, `time`)
) engine=InnoDB default character set ascii
partition by range(time) (
  partition start        values less than (0),
  partition from20180101 values less than (unix_timestamp('2018-02-01')),
  partition from20180201 values less than (unix_timestamp('2018-03-01')),
  ...,
  partition future       values less than MAX_VALUE
)

是的，列顺序与键顺序不匹配.

Yes, the column order doesn't match the key order.

在Python中，我填充了一个有500,000行的DataFrame(这可能不是最有效的方法，但可以作为数据看起来像的一个示例):

In Python I'm populating a DataFrame with 500,000 rows (this is probably not the most efficient way to do this, but serves as a sample for what the data may look like):

import random
import pandas as pd
key2_values = ["aaa", "bbb", ..., "ttt"]  # 20 distinct values
key3_values = ["aa", "ab", "ac", ..., "az", "bb", "bc", ..., "by"]  # 50 distinct values
df = pd.DataFrame([], columns=["key1", "key2", "key3", "value2", "value1"])
idx = 0
for x in range(0, 500):
    for y in range(0, 20):
        for z in range(0, 50):
            df.loc[idx] = [x, key2_values[y], key3_values[z], random.random(), random.random()]
            idx += 1
df.set_index(["key1", "key2", "key3"], inplace=True)

(实际上，此DataFrame是通过几个API调用和大量数学运算填充而成的，但最终结果是相同的:一个巨大的DataFrame，具有约500,000行和与InnoDB表匹配的键)

(In reality this DataFrame is populated from several API calls and a lot of math, but the end result is the same: a huge DataFrame with ~500,000 rows and keys matching the InnoDB table)

要将这个DataFrame导入表中，我目前正在执行以下操作:

To import this DataFrame into the table, I'm currently doing the following:

import time
import MySQLdb
conn = MySQLdb.connect(local_infile=1, **connection_params)
cur = conn.cursor()
# Disable data integrity checks -- I know the data is good
cur.execute("SET foreign_key_checks=0;")
cur.execute("SET unique_checks=0;")
# Append current time to the DataFrame
df["time"] = time.time()
df.set_index(["time"], append=True, inplace=True)
# Sort data in primary key order
df.sort_index(inplace=True)
# Dump the data to a CSV
with open("dump.csv", "w") as csv:
    df.to_csv(csv)
# Load the data
cur.execute(
    """
        load data local infile 'dump.csv'
        into table `my_table`
        fields terminated by ','
        enclosed by '"'
        lines terminated by '\n'
        ignore 1 lines
        (`key1`, `key2`, `key3`, `time`, `value`)
    """
)
# Clean up
cur.execute("SET foreign_key_checks=1;")
cur.execute("SET unique_checks=1;")
conn.commit()

在所有性能上都还不错.我可以在2分钟内导入500,000行.如果可能的话，我想更快地做到这一点.

In all the performance on this isn't too bad. I can import 500,000 rows in about 2 minutes. If possible I want to get this faster.

我是否缺少任何技巧或可以进行任何更改以将其降低到30-45秒?

Are there any tricks I'm missing or any changes I could make to get this down to 30-45 seconds?

一些注意事项:

我不知道在DataFrame中重新排列列是否会影响性能.当前，DataFrame中的列顺序与数据库不匹配
我不知道更改数据库中列的顺序以匹配主键的顺序是否会影响性能(当前，时间"排在第一位，即使它是索引的第四把键也是如此)
更改数据库配置可能很困难，因为我没有直接访问数据库服务器的权限.我对已经存在的任何硬件和配置选项都感到困惑.任何性能上的改进都必须来自我的Python代码
我可以更改表定义(包括更改分区)，但是如果可能的话，我想避免这种情况，因为已经有大量的历史数据并将其复制到另一个表中会花费很多时间.很久.丢失此数据是一种选择，但我宁愿避免
我无法使用set sql_log_bin=0;，因为我没有数据库的SUPER特权

I don't know if reordering the columns in the DataFrame will affect performance. Currently the order of columns in the DataFrame does not match the database
I don't know if changing the order of the columns in the database to match the order of the primary key will affect performance (currently "time" comes first, even though it's the fourth key of the index)
Altering the database config could be difficult, as I don't have direct access to the database server. I'm stuck with whatever hardware and configuration options are already present. Any performance improvements must come from my Python code
I can change the table definition (including changing the partitioning) however I would like to avoid this if possible as there is already a large amount of historic data and copying it to another table would take a long time. Losing this data is an option, but one I'd rather avoid
I cannot use set sql_log_bin=0; because I do not have the SUPER privilege on the database

变更2(可能有影响)-修改的列顺序

我调整了表格，使列的顺序与主键(time，key1，key2，key3)的顺序匹配

Change 2 (may have had an impact) -- Modified column order

I adjusted the table so that the order of the columns matched the order of the primary key (time, key1, key2, key3)

我不知道这是否有效果，但是可能有效果

I don't know if this had an effect, but it might have

我在DataFrame上运行了以下内容:

I ran the following on my DataFrame:

df.reindex(columns=["value1", "value2"], inplace=True)

这对列进行了排序，以匹配它们在数据库中出现的顺序.在此更改与更改2之间，可以完全按原样导入行，而无需交换列的顺序.我不知道这是否会对进口业绩产生影响

This sorted the columns to match the order they appeared in the database. Between this and change 2, the rows could be imported exactly as they were without needing to swap the order of columns. I don't know if that has any impact on import performance

通过这三个更改，我的导入时间从2分钟降低到了9秒！ 那太不可思议了

With these three changes my import is down from 2 minutes to 9 seconds! That's absolutely incredible

我担心向表中添加额外的键，因为额外的索引意味着更长的写入时间和更多的磁盘空间，但是效果几乎可以忽略不计-尤其是与正确地对键进行群集所节省的大量费用相比.

I was worried about adding the extra key to the table since additional indexes means longer write times and more disk space, but the effect was almost negligible -- especially compared to the massive savings I got from clustering my key correctly.

这篇关于提高MySQLdb加载数据文件的性能的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

提高MySQLdb加载数据文件的性能 [英] Improving MySQLdb load data infile performance

问题描述

推荐答案

变更2(可能有影响)-修改的列顺序

Change 2 (may have had an impact) -- Modified column order

相关文章

Python最新文章

热门教程

热门工具

登录关闭

提高MySQLdb加载数据文件的性能 [英] Improving MySQLdb load data infile performance

问题描述

推荐答案

变更2(可能有影响)-修改的列顺序

Change 2 (may have had an impact) -- Modified column order

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭