排除子查询返回结果需要互斥的记录 [英] Excluding records where subquery returns results that need to be mutually exclusive

查看:36
本文介绍了排除子查询返回结果需要互斥的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标题不够解释...

我每小时运行一次查询以准备 MEMORY 表,然后将这些表用于下一小时的一些高强度流量.现在查询看起来像:

I'm running a query hourly to prepare MEMORY tables that are then used for some high intensive traffic for the next hour. Right now the query looks like:

INSERT INTO tmp_table
            SELECT DISTINCT SQL_NO_CACHE
            B.*, G.node
            FROM books B
            RIGHT JOIN book_genres G on G.asin=B.asin
            WHERE EXISTS
            (
              SELECT 1 FROM genres K WHERE K.node=G.node
            )
            AND...[nothing special here]

所以books 表只保存了带有主键的书籍数据作为asin.book_genres 包含 asin 和 node 字段,并包含许多用于站点各个部分的节点.然而,tmp_table 只需要包含流派表中的节点子集,因此是子查询.希望这在不发布完整架构的情况下就足够了.

So the books table just holds book data with a PRIMARY KEY for asin. book_genres contains asin and node fields and contains lots of nodes that are used in various parts of the site. However the tmp_table only needs to include a sub-set of nodes that are in the genres table, hence the sub-query. Hopefully this is enough without posting the full schema.

我们现在想让某些类型互斥.这意味着在构建 tmp_table 之后执行此操作:

We now want to make certain genres mutually exclusive. This would mean doing this after building tmp_table:

# Delete records that should have been excluded
DELETE T FROM tmp_table T INNER JOIN 
(
    SELECT N.* FROM tmp_table N INNER JOIN
    (
        SELECT DISTINCT ASIN FROM tmp_table 
        INNER JOIN genres ON genres.node=tmp_table.node
        WHERE isFiction=1
    ) F 
    ON F.asin=N.asin 
    INNER JOIN genres ON  genres.node=N.node
    WHERE genres.isNonFiction=1 
) D 
USING (asin, node)
WHERE D.asin=T.asin AND D.node=T.node

因此,如果 asin 属于 isFiction=1 的类型,则删除所有 nonFiction=1 的类型.

So if the asin is in a genre with isFiction=1, all genres with nonFiction=1 are removed.

但这感觉很难看:使用查询添加数据然后再次删除它.有什么办法可以将其合并为一个查询以避免双通.或者我只是在为自己工作,因为(经过一些调整和编辑后)上述工作确实很快就可以接受.

But this feels ugly: to add data using a query and then remove it again. Is there any way this can be combined into one query to avoid the double-pass. Or am I just making work for myself, given that (after some tweaks and edits) the above does work acceptably quickly.

任何想法将不胜感激,谢谢.

Any ideas would be appreciated, thanks.

添加示例:

books table:
asin        title       price     etc...
B111111111  Book1       $0.99     ....
B222222222  Book2       $0.99     ....
B333333333  Book2       $0.99     ....

book_genres table:
asin        node
B111111111  1111
B111111111  2222
B111111111  3333
B111111111  5555
B222222222  1111
B222222222  3333
B222222222  4444
B333333333  1111
B333333333  2222

genres table:
node    name         isFiction    isNonFiction
1111    Bestsellers  0            0
2222    Romance      1            0
3333    Biographies  0            1
4444    History      0            1

因此在 INSERT INTO 运行后,tmp_table 将如下所示:

So after the INSERT INTO has been run, tmp_table will look like this:

asin       title       node 
B11111111  Book1       1111
B11111111  Book1       2222
B11111111  Book1       3333
B22222222  Book2       1111
B22222222  Book2       3333
B22222222  Book2       4444
B33333333  Book3       1111
B33333333  Book3       2222

Book1 没有节点 5555 的记录,因为它不在流派表中,我们在 tmp_table 中不需要它.其他所有内容几乎都是数据的非规范化,因为 WHERE 可以识别数据库中数百万本书中的一小部分,这样处理起来要快得多.

Book1 does not have a record for node 5555 because it is not in the genres table and we don't need this in tmp_table. Everything else is pretty much a de-normalisation of data because WHERE identifies a subsection of thousands of books out of millions in the DB, which is much quicker to work with.

下一步确​​保如果一本书有 isFiction 节点,则删除该书的所有 isNonFiction 节点.

The next step ensures that if a book has an isFiction node, all isNonFiction nodes for that book are deleted.

运行 DELETE 后,tmp_table 中的最终结果是:

After the DELETE has been run, the end result in tmp_table is:

asin       title       node 
B11111111  Book1       1111
B11111111  Book1       2222
B22222222  Book2       1111
B22222222  Book2       3333
B22222222  Book2       4444
B33333333  Book3       1111
B33333333  Book3       2222

唯一的区别是 Book1 的节点 3333 已被删除,因为 Book1 位于 isFiction=1 的 2222 节点中,而节点 3333 的 isNonFiction=1.Book2 没有改变,因为它不包含 isFiction 节点.同样,Book3 没有变化,因为它不包含 isNonFiction 节点.

The only difference being node 3333 has been removed for Book1 because Book1 is in 2222 node which has isFiction=1 and node 3333 has isNonFiction=1. Book2 is unchanged because it does not contain isFiction nodes. Similarly, Book3 is unchanged because it does not contain isNonFiction nodes.

在这个阶段,它正在使用这个实现,虽然运行时间从大约 20 多秒增加到大约 40 秒.这并不奇怪,因为 DELETE 有点复杂.这可能是一个不错的解决方案,但如果其他人有一个想法可以让整个事情变得更简单或更快捷,我会很高兴.

At this stage, it is working with this implementation, although the time to run has doubled from about 20-odd seconds, to about 40. No surprise really as the DELETE is a bit convoluted. It might be a decent enough solution, but I'll be happy if someone else has an idea that would make the whole thing more simple or quicker.

标记...

推荐答案

而且很丑.它工作正常,直到数据库上有任何其他显着负载,然后一切都非常缓慢.这主要归结于服务器的 IO 限制,但更简单的方法是将 isfiction 和 isNonFiction 放入 MEMORY 表中,然后 DELETE 语句可以如下所示:

And it was ugly. It works fine until there is any other significant load on the DB, and then it all goes very slowly. This is mostly down to IO limitations of the server, but the simpler approach was to get isfiction and isNonFiction into the MEMORY tables and the DELETE statement can then look like:

    DELETE tmp_table FROM tmp_table
         INNER JOIN
         (
            SELECT ASIN, MAX( isFiction ) AS isFiction, MAX( isNonFiction ) AS isNonFiction
            FROM tmp_table
            GROUP BY ASIN
            HAVING isFiction =1
            AND isNonFiction =1
         ) D
         WHERE D.ASIN=tmp_table.ASIN AND tmp_table.isNonFiction=1

在测试中,这将整个过程从大约 90 秒缩短到 10 秒.

In testing, this is getting the entire process down from about 90 seconds to 10 seconds.

这篇关于排除子查询返回结果需要互斥的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆