如何跨多个表检索大型数据集并避免循环查询 [英] How to retrieve large sets of data accross multiple tables and avoid a looping queries

查看:107
本文介绍了如何跨多个表检索大型数据集并避免循环查询的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先很抱歉,如果问题已经回答,我在这里和Google上都进行了搜索,但找不到我的答案.这个问题可能没有被问过,但是在所有仅需使用LEFT JOIN"和将其存储在数组中"的答案下,该问题就被隐藏了很深.

First sorry if the question was already answered, I searched both here and Google and couldn't find my answer. This question can't possibly haven't been asked, but it is hidden pretty deep under all the "Just use LEFT JOIN" and "store it in an array" answers.

我需要加载散布在多个表中的大量数据(然后将其插入到另一个数据库引擎中,但这并不重要,我需要优化我的SELECTs).

I need to load a lot of data spread across multiple tables (then insert it into another database engine, but that's not important, I need to optimize my SELECTs).

表布局如下:

具有a_id字段的表A 具有a_id和b_id字段的表B 具有b_id和c_id字段的表C ...(又是3-4个级别).

Table A with a_id field Table B with a_id and b_id field Table C with b_id and c_id field ... (goes another 3-4 levels like this).

我目前正以这种方式访问​​数据(伪代码):

I currently access the data this way (pseudo code):

query1 = SELECT ... FROM TableA WHERE something=$something

foreach query1 as result1:

    query2 = SELECT ... FROM TableB WHERE b_id=result1.a_id

    foreach query2 as result2:
        query3 = SELECT ... FROM TableC WHERE bc_id=result2.b_id

            foreach query3 as result3:
                // Another few levels of this, see the millions of SELECTs coming?

到目前为止,我发现的唯一解决方案是:

The only solutions I have found so far are:

  1. 使用慢速方式并发送多个查询(当前解决方案,并且需要很长时间才能完成我的小型测试集)
  2. 使用大量的LEFT JOIN将所有数据包含在一个查询中.涉及数千次传输大量数据,因此客户端上的一些特殊逻辑将其再次拆分为适当的表,因为每一行都将包含其父表的内容. (我使用OOP,每个表都映射到一个对象,并且每个对象都包含另一个对象.)
  3. 将表A中的每个对象存储在一个数组中,然后加载所有表B,存储到一个数组中,然后继续在表C上运行.适用于小集合,但是我的空间大一些GB,根本不适合ram

有没有办法避免在这样的循环中每秒进行1万次查询?

Is there a way to avoid doing 10k queries per second in such a loop?

(我正在使用PHP,如果可以帮助的话,可以将MySQL从MySQL转换为MongoDB,它可以更好地以这种方式处理嵌套对象)

(I'm using PHP, converting from MySQL to MongoDB which handles nested objects like this way better, if this helps)

关于我正在尝试做的事情和原因似乎有些困惑.我将尝试更好地解释:我需要将批处理转换为新结构.新的结构效果很好,甚至不必费神.我正在从头开始重新构建一个非常老的网站,并选择MongoDB作为我的存储引擎,因为我们有大量这样的嵌套数据,并且对我来说很好用.对我来说,切换回MySQL甚至都不是一种选择,新的结构和代码已经很好地建立了,并且我已经从事此工作大约一年了. 我不是在寻找优化当前模式的方法.数据是那样的,我需要读取整个数据库.一次.然后我就完成了.

There seem to have some confusions about what I'm trying to do and why. I will try to explain better: I need to do a batch conversion to a new structure. The new structure works very well, don't even bother looking on that. I'm remaking a very old website from scratch, and chose MongoDB as my storage engine because we have loads of nested data like this, and it works very well for me. Switching back to MySQL is not even an option for me, the new structure and code is alreay well established and I've been working on this for about a year now. I am not looking in a way to optimize the current schema, I can't. The data is that way, and I need to read the whole database. Once. Then I'm done with it.

我需要做的是从旧网站中导入数据,对其进行处理和转换,以便可以将其插入我们的新网站中. MySQL来了:较旧的站点是一个非常普通的PHP/MySQL站点.我们有很多桌子(实际上大约有70张左右).我们的用户并不多,但是每个用户在7个表上都有大量数据.

All I need to do, is to import the data from the old website, process this and convert it so I can insert it into our new website. Here comes MySQL: The older site was a very normal PHP/MySQL site. We have a lot of tables (about 70 actually or something). We don't have many users, but each users have a ton of data spanned on 7 tables.

我目前正在做什么,是我在每个用户上循环(1个查询).对于每个这些用户(70k),我加载表A,其中包含每个用户10-80行.然后,我在A的每个循环上查询表B(因此,是10-80乘以70k),其中每个A包含1-16行.表C则为每个B保留1-4行.我们现在是4 * 80 * 70k个查询.然后我有D,每个C.E有1-32行,每个D.有1-16行.F有每个E的1-16行.表F有几百万行.

What I currently do, is that I loop on each user (1 query). For each of these users (70k), I load Table A which contains 10-80 rows for each user. I then query Table B on every loop of A (so, 10-80 times 70k), which contains 1-16 rows for each A. There comes Table C, which holds 1-4 rows for each B. We are now at 4*80*70k queries to do. Then I have D, 1-32 rows for each C. E with 1-16 rows for each D. F with 1-16 rows for each E. Table F has a couple of millions rows.

问题

  • 我最终对MySQL服务器进行了数千甚至是数百万次的查询,其中生产数据库甚至不在我的本地计算机上,而是5-10毫秒之遥.即使是0.01毫秒,我的网络延迟也只有几个小时.我创建了一个本地副本,因此受限制的测试集运行得相当快,但是下载这样的数GB的数据仍需要花很长时间.

  • I end up doing thousands if not millions of queries to the MySQL server, where the production database is not even on my local machine, but 5-10ms away. Even at 0.01ms, I have hours just in network latency. I created a local replica so my restricted test set runs quite faster, but it's still going to take a long while to download a few GB of data like this.

我可以将members表保留在RAM中,也可以保留在Table A中,这样一来,我就可以一次下载每个数据库,而不用执行数千次查询,但是一旦在Table B处进行跟踪,这将是一个真正的混乱在内存中,特别是因为我使用PHP(至少是命令行),因此与使用C ++程序(可以对RAM进行严格控制)相比,它使用的内存要多一些.因此,该解决方案也不起作用.

I could keep the members table in RAM and maybe Table A so I can download each database in one shot instead of doing thousands of queries, but once at Table B and further it would be a real mess to track this in memory, especially since I use PHP (command line, at least), which uses a bit more memory than if it was a C++ program where I could have tight RAM control. So this solution doesn't work either.

我可以将所有表联接在一起,但是如果它适用于2-3个表,那么对7个表执行此操作将导致额外的巨大带宽损失,这可能导致数百万次不使用而从服务器传输相同数据(同时也使得将代码按适当的顺序拆分回原来的代码非常复杂).

I could JOIN all the tables together, but if it works for 2-3 tables, doing this for 7 tables would result in an extra huge bandwidth loss transferring the same data from the server millions of times without a use (while also making the code really complicated to split them back in the appropriate order).

问题是:有没有办法不那么频繁地查询数据库?就像,告诉MySQL服务器一个过程或某些事情,我将按此顺序需要所有这些数据集,这样我就不必在每一行都重新执行查询,因此数据库只是为我不断地吐出数据?当前的问题只是我进行了太多查询,以致脚本和数据库都几乎处于空闲状态,因为一个人总是在等待另一个.查询本身实际上是非常快的,对索引int字段进行基本准备的SELECT查询.

Question is: Is there a way to not query the database so often? Like, telling the MySQL server with a procedure or something that I will need all these datasets in this order so I don't have to re-do a query each row and so the database just continually spits out data for me? The current problem is just that I do so much queries that both the script AND the database are almost idle because one is always waiting for another one. The queries themselves are actually very fast, basic prepared SELECT queries on indexed int fields.

这是我过去一直沉迷于MySQL的一个问题,直到现在我才真正给我带来麻烦.在当前状态下,脚本需要花费数小时甚至数天的时间才能完成.并不坏,但是如果有办法我可以做得更好,我将不胜感激.如果没有,那我就等它完成,至少它将最多运行3-4次(2-3次测试运行,请用户检查其数据是否正确转换,修复错误,再试一次,然后最后运行,最后一个错误修正).

This is a problem I always got myself into with MySQL in the past, which never really caused me trouble until now. In its current state, the script takes several hours if not days to complete. It's not THAT bad, but if there's a way I can do better I'd appreciate to know. If not, then okay, I'll just wait for it to finish, at least it will run max 3-4 times (2-3 test runs, have users check their data is converted correctly, fix bugs, try again, and the final run with the last bugfixes).

提前谢谢!

推荐答案

感谢大家的帮助.我得出的结论是,我实际上无法以其他任何方式做到这一点.

Thanks everyone for the anwers. I came to the conclusion that I can't actually do it any other way.

我自己的解决方案是在本地主机上设置副本数据库(如果快照足够,则仅创建副本).这样,它可以减少网络延迟,并允许脚本和数据库都达到100%的CPU使用率,而且这似乎是在不完全重新组织脚本的情况下可以获得的最快速度.

My own solution is to set up a replica database (or just a copy if a snapshot is enough) on localhost. That way, it cuts down the network latency and allows both the script and the database to reach 100% CPU usage, and it seems to be the fastest I can get without reorganizing my script entirely.

当然,这仅适用于一次性脚本.解决这个问题的正确方法是混合两个答案:在线程中使用多个未缓冲的连接,然后按批处理(从表A加载50行,存储在ram中,从表A加载与表A相关的所有数据)表B,存储在RAM中,然后处理所有内容,然后从表A)继续进行操作.

Of course, this only works for one-time scripts. The correct way to handle this would be a mix of both answers I got as of now: use multiple unbuffered connections in threads, and process by batch (load 50 rows from Table A, store in ram, load all data related to Table A from Table B, store in RAM, then process all that and continue from Table A).

仍然感谢所有答案!

这篇关于如何跨多个表检索大型数据集并避免循环查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆