MySQL巨大的表JOIN使数据库崩溃 [英] MySQL huge tables JOIN makes database collapse

查看:110
本文介绍了MySQL巨大的表JOIN使数据库崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关注我最近的问题从中选择信息最后一项并加入总数,我在生成表时遇到一些内存问题

Following my recent question Select information from last item and join to the total amount, I am having some memory problems while generation tables

我有两个表sales1sales2像这样:

id |日期|客户|出售

id | dates | customer | sale

使用此表定义:

CREATE TABLE sales (
    id int auto_increment primary key, 
    dates date,
    customer int,
    sale int
);

sales1sales2具有相同的定义,但是sales2在每个字段中都有sale=-1.客户可以不在一个表中,也可以在两个表中.两个表都有大约300.000条记录,并且字段多于此处指示的数量(大约50个字段).他们是InnoDB.

sales1 and sales2 have the same definition, but sales2 has sale=-1 in every field. A customer can be in none, one or both tables. Both tables have around 300.000 records and much more fields than indicated here (around 50 fields). They are InnoDB.

我要为每个客户选择:

  • 购买数量
  • 最后购买价值
  • 购买总额(具有正值)

我正在使用的查询是:

SELECT a.customer, count(a.sale), max_sale
FROM sales a
INNER JOIN (SELECT customer, sale max_sale 
        from sales x where dates = (select max(dates) 
                                    from sales y 
                                    where x.customer = y.customer
                                    and y.sale > 0
                                   )

       )b
ON a.customer = b.customer
GROUP BY a.customer, max_sale;

问题是:

我必须获取需要进行某些计算的结果,并按日期分开:2012年的信息,2013年的信息以及所有年份的信息.

I have to get the results, that I need for certain calculations, separated for dates: information on year 2012, information on year 2013, but also information from all the years together.

每当我做一年的时候,存储所有信息大约需要2-3分钟.

Whenever I do just one year, it takes about 2-3 minutes to storage all the information.

但是,当我试图收集这些年来的信息时,数据库崩溃了,并且我收到如下消息:

But when I try to gather information from all the years, the database crashes and I get messages like:

InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')

对于数据库来说,联接如此巨大的表似乎太多了.当我explain查询时,几乎所有时间百分比都来自creating tmp table.

It seems that joining such huge tables is too much for the database. When I explain the query, almost all the percentage of time comes from creating tmp table.

我想将数据收集分成几个季度.我们每三个月获得一次结果,然后加入并对其进行排序.但是我想对于数据库来说,最后的连接和排序将再一次变得多余了.

I thought in splitting the data gathering in quarters. We get the results for every three months and then join and sort it. But I guess this final join and sort will be too much for the database again.

所以,只要我不能更改表结构,您的专家会建议什么来优化这些查询?

So, what would you experts recommend to optimize these queries as long as I cannot change the tables structure?

推荐答案

300k行不是一个巨大的表.我们经常看到3亿行表.

300k rows is not a huge table. We frequently see 300 million row tables.

查询的最大问题是您正在使用相关子查询,因此它必须为外部查询中的每一行重新执行子查询 .

The biggest problem with your query is that you're using a correlated subquery, so it has to re-execute the subquery for each row in the outer query.

通常不需要在一个SQL语句中完成所有工作.将其分解为几个更简单的SQL语句有很多优点:

It's often the case that you don't need to do all your work in one SQL statement. There are advantages to breaking it up into several simpler SQL statements:

  • 更易于编码.
  • 更容易优化.
  • 更容易调试.
  • 更容易阅读.
  • 如果/当您必须实施新要求时,更易于维护.
SELECT customer, COUNT(sale) AS number_of_purchases
FROM sales 
GROUP BY customer;

关于销售(客户,销售)的索引最适合此查询.

An index on sales(customer,sale) would be best for this query.

这是经常出现的 greatest-n-per-group 问题. /p>

This is the greatest-n-per-group problem that comes up frequently.

SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
 ON a.customer=b.customer AND a.dates < b.dates
WHERE b.customer IS NULL;

换句话说,尝试将行a与具有相同客户和较大日期的假设行b进行匹配.如果找不到这样的行,则a必须具有该客户的最长时间.

In other words, try to match row a to a hypothetical row b that has the same customer and a greater date. If no such row is found, then a must have the greatest date for that customer.

关于销售(客户,日期,销售)的索引最适合此查询.

An index on sales(customer,dates,sale) would be best for this query.

如果您在那个最大日期为一个客户进行的销售不止一次,则此查询将为每个客户返回多个行.您需要另找一列才能打破平局.如果您使用自动递增主键,则它很适合作为平局决胜局,因为它可以保证是唯一的,并且倾向于按时间顺序递增.

If you might have more than one sale for a customer on that greatest date, this query will return more than one row per customer. You'd need to find another column to break the tie. If you use an auto-increment primary key, it's suitable as a tie breaker because it's guaranteed to be unique and it tends to increase chronologically.

SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
 ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL;

具有正值的购买总额

SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE sale > 0
GROUP BY customer;

关于销售(客户,销售)的索引最适合此查询.

An index on sales(customer,sale) would be best for this query.

您应该考虑使用NULL来表示缺失的销售价值,而不是-1.诸如SUM()和COUNT()之类的聚合函数会忽略NULL,因此您不必使用WHERE子句来排除具有sale<的行. 0.

You should consider using NULL to signify a missing sale value instead of -1. Aggregate functions like SUM() and COUNT() ignore NULLs, so you don't have to use a WHERE clause to exclude rows with sale < 0.

回复:您的评论

我现在所拥有的是一个表,其中包含年,季度,total_sale(关于货币对(年,季度))和销售.我要收集的是有关特定时期的信息:该季度,季度,2011年...信息必须拆分为顶级客户,销售量较大的客户,等等.是否有可能从具有以下条件的客户那里获得最后的购买价值: total_purchases大于5?

What I have now is a table with fields year, quarter, total_sale (regarding to the pair (year,quarter)) and sale. What I want to gather is information regarding certain period: this quarter, quarters, year 2011... Info has to be splitted in top customers, ones with bigger sales, etc. Would it be possible to get the last purchase value from customers with total_purchases bigger than 5?

2012年第四季度前五名客户

SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE (year, quarter) = (2012, 4) AND sale > 0
GROUP BY customer
ORDER BY total_purchases DESC
LIMIT 5;

我想针对真实数据进行测试,但是我相信针对该查询的销售(年度,季度,客户,销售)指数将是最佳选择.

I'd want to test it against real data, but I believe an index on sales(year, quarter, customer, sale) would be best for this query.

SELECT a.customer, a.sale as max_sale
FROM sales a
INNER JOIN sales c ON a.customer=c.customer
LEFT OUTER JOIN sales b
 ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL
GROUP BY a.id
HAVING COUNT(*) > 5;

与上述其他每组最多的n个查询一样,此查询最适合根据sales(customer,dates,sale)进行索引.它可能无法同时优化联接和分组依据,因此将产生一个临时表.但是至少它只会做一个临时表,而不是很多.

As in the other greatest-n-per-group query above, an index on sales(customer,dates,sale) would be best for this query. It probably can't optimize both the join and the group by, so this will incur a temporary table. But at least it will only do one temporary table instead of many.

这些查询足够复杂.您不应该尝试编写单个SQL查询来给出 all 所有这些结果.记住Brian Kernighan的经典名言:

These queries are complex enough. You shouldn't try to write a single SQL query that can give all of these results. Remember the classic quote from Brian Kernighan:

每个人都知道调试的难度是一开始编写程序的两倍.因此,如果您在编写时尽可能聪明,那么将如何调试它?

Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?

这篇关于MySQL巨大的表JOIN使数据库崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆