最快的方式子集 - data.table与MySQL [英] Fastest way to subset - data.table vs. MySQL

查看:156
本文介绍了最快的方式子集 - data.table与MySQL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是一个R用户,我经常发现我需要编写需要子集化大型数据集(百万行)的函数。当我对大量的观察应用这样的函数时,如果我不小心我如何实现它,可能会非常耗时。

I'm an R user, and I frequently find that I need to write functions that require subsetting large datasets (10s of millions of rows). When I apply such functions over a large number of observations, it can get very time consuming if I'm not careful about how I implement it.

为了做到这一点,我有时使用data.table包,并且这提供比使用数据帧的子集化快得多的速度。最近,我开始尝试像RMySQL这样的包,将一些表推送到mysql,并使用该包运行sql查询并返回结果。

To do this, I have sometimes used the data.table package, and this provides much faster speeds than subsetting using data frames. Recently, I've started experimenting with packages like RMySQL, pushing some tables to mysql, and using the package to run sql queries and return results.

我发现混合性能改进。对于较小的数据集(百万),似乎将数据加载到data.table中并设置正确的键可以加快子集。对于较大的数据集(10到100万),它发出一个查询到mysql移动速度更快。

I have found mixed performance improvements. For smaller datasets (millions), it seems that loading up the data into a data.table and setting the right keys makes for faster subsetting. For larger datasets (10s to 100s of millions), it appears the sending out a query to mysql moves faster.

想知道是否有人深入了解哪种技术应该更快地返回简单的子集化或聚合查询,以及这是否应该取决于数据的大小?我理解在data.table中设置键有点类似于创建一个索引,但我没有更多的直觉。除了这个。

Was wondering if anyone has any insight into which technique should return simple subsetting or aggregation queries faster, and whether or not this should depend on the size of the data? I understand that setting keys in data.table is somewhat analogous to creating an index, but I don't have much more intuition beyond that.

推荐答案

如果数据适合RAM,data.table更快。如果你提供一个例子,它很可能会变得很明显,很快,你使用data.table不好。您是否阅读过 data.table wiki 上的不要做的事情?

If the data fits in RAM, data.table is faster. If you provide an example it will probably become evident, quickly, that you're using data.table badly. Have you read the "do's and don'ts" on the data.table wiki?

SQL具有下限,因为它是一个行存储。如果数据适合RAM(64位是相当一点),那么data.table不仅仅是因为它在RAM中,而是因为列在内存中是连续的(最小化从RAM到页面操作的L2的页面)。正确地使用data.table,它应该比SQL的下限快。这在FAQ 3.1中解释。如果你看到与data.table缓慢,那么你使用data.table的错误率很高(或者有一个性能错误,我们需要修复)。所以,请阅读data.table wiki后发布一些测试。

SQL has a lower bound because it is a row store. If the data fits in RAM (and 64bit is quite a bit) then data.table is faster not just because it is in RAM but because columns are contiguous in memory (minimising page fetches from RAM to L2 for column operations). Use data.table correctly and it should be faster than SQL's lower bound. This is explained in FAQ 3.1. If you're seeing slower with data.table, then chances are very high that you're using data.table incorrectly (or there's a performance bug that we need to fix). So, please post some tests, after reading the data.table wiki.

这篇关于最快的方式子集 - data.table与MySQL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆