维护水数据帧的有效方法 [英] Efficient way to maintain a h2o data frame

查看:87
本文介绍了维护水数据帧的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可以说我有一个返回数据的函数"getData()"(将其视为数据流).现在,我需要使用这些数据形成一个h2o数据帧.只有在数据框中之前不存在的情况下,我才需要将它们插入为新行.

Lets say I have a function 'getData()' which returns data (see of it as a data stream). Now I need to form a h2o data frame with these data. I need to insert them as a new row only if it is not present in the data frame before.

一种明显的方法是:

  1. 有一个全局的h2o数据框
  2. 根据到达的数据创建一个h2o数据帧(共1行). (我正在使用as.h2o())
  3. 检查它是否已经存在于全局数据框中(使用h2o.which()或任何其他函数)
  4. 如果不存在,则将其添加到数据框中(使用h2o.rbind())

上述解决方案太慢.每次数据到达(第2步)时创建h2o数据帧会花费太多时间. (仅在小型数据集上进行了测试)

The above solution is too slow. Creation of h2o data frame every time the data arrives (2nd step) is taking too much time. (Only tested on small dataset)

我还考虑将它们存储在R数据帧中,然后在一定间隔后使用h2o.rbind().

I was also thinking of storing them in a R data frame and then using h2o.rbind() after some intervals.

哪种方法最好(时间优先)?

What is the best (time is the priority) way to do it?

推荐答案

您肯定要尽可能减少对as.h2o()的调用,因为该函数实际上将数据从R内存写入磁盘,然后将数据读取到H2O中.从磁盘群集.它应谨慎使用.但是,加快as.h2o()调用速度的一种方法是在后端使用 data.table .如果已安装 data.table ,则可以在代码顶部添加以下行,它将使用data.table::fwrite()代替as.h2o()内部的utils::write.csv().

You definitely want to minimize calls to as.h2o() as much as possible since that function actually writes data from R memory to disk and then reads the data into the H2O cluster from disk. It's meant to be used sparingly. However, one way to speed up the as.h2o() call is to use data.table on the backend. If you have data.table installed, you can add the following line to the top of your code and it will use data.table::fwrite() instead of utils::write.csv() inside of as.h2o().

library(data.table)
options("h2o.use.data.table" = TRUE)

由于要最小化对as.h2o()的调用,因此在R data.frame中存储几百或几千行,然后使用as.h2o()定期将该data.frame转换为H2OFrame可能会更快(使用 data.table 后端),然后浏览H2OFrame的行以查看哪些是新的,然后使用h2o.rbind()将它们添加到您的全局" H2OFrame中.

Since you want to minimize calls to as.h2o(), it will probably be faster to store a few hundred or thousand rows in an R data.frame and then periodically convert that data.frame to an H2OFrame using as.h2o() (using data.table backend), then scan through the rows of the H2OFrame to see which ones are new and then add them to your "global" H2OFrame using h2o.rbind().

确定哪种方法会更快的唯一方法是在数据和计算机上测试这两种方法.

The only way to know for sure which method will be faster is to test both methods on your data and your machine.

这篇关于维护水数据帧的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆