如何快速将数据导入h2o [英] How to get data into h2o fast

查看:132
本文介绍了如何快速将数据导入h2o的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题不是:





硬件/空间:




  • 32个Xeon线程w / ~256 GB Ram

  • ~65 GB要上传的数据。 (约56亿个单元格)



问题:

这需要几个小时才能完成将我的数据上传到h2o。这不是任何特殊处理,只有as.h2o(...)。



使用fread将文本放入空间然后我进行一些行/列转换(差异,滞后)并尝试使用不到一分钟进口。



在尝试任何类型的as.h2o之前,总R内存大约是56GB,所以分配的128不应该太疯狂,不是吗?



问题:

如果需要不到一个小时加载到h2o,我该怎么办?它应该花费一分钟到几分钟,不再。



我尝试了什么:


$使用slam,data.table和options(...
  • 在as.h2o之前转换为as.data.frame

  • 写入csv文件(r write.csv chokes并永远占用。虽然它写了很多GB,所以我理解。)

  • 写入sqlite3,表格列太多,这很奇怪。

  • 检查驱动器缓存/交换以确保那里有足够的GB。也许java正在使用缓存。(仍在工作)



  • 更新:

    所以看起来我唯一的选择就是创建一个巨大的文本文件,然后使用h2o.importFile(...)。它高达15GB写了。



    Update2:

    这是一个可怕的csv文件,大约22GB(~2.4Mrows,~2300)对于它的价值,从下午12:53到下午2:44到了写csv文件。在写完之后导入它的速度要快得多。

    解决方案

    想想 as.h2o()作为便利函数,执行以下步骤:


    1. 将您的R数据转换为data.frame,如果不是已经只有一个。

    2. 将data.frame保存到本地磁盘上的临时文件中(它将使用 data.table :: fwrite() if available(*),否则 write.csv()

    3. call h2o.uploadFile()该临时文件

    4. 删除临时文件

    As您的更新说,将大量数据文件写入磁盘可能需要一段时间。但另一个痛点是使用 h2o.uploadFile()而不是更快的 h2o.importFile()。决定使用的是可见性:




    • 使用 h2o.uploadFile()您的客户必须能够看到该文件。

    • 使用 h2o.importFile()您的群集必须能够看到文件。



    当您的客户端与您的某个群集节点在同一台计算机上运行时,您的数据文件对客户端和客户端都可见。群集,所以总是更喜欢 h2o.importFile()。 (它执行多线程导入。)



    另外几个提示:只将数据带入您实际需要的R会话中。并且记住R和H2O都是以列为导向的,所以cbind可以很快。如果您只需要在R中处理100个2300列,请将它们放在一个csv文件中,并将其他2200列保留在另一个csv文件中。然后 h2o.cbind()将它们加载到H2O后。



    *:使用 h2o ::: as.h2o.data.frame (没有括号)查看实际代码。对于data.table编写,您需要先执行选项(h2o.use.data.table = TRUE);您还可以选择使用 h2o.fwrite 选项打开/关闭它。


    What my question isnt:

    Hardware/Space:

    • 32 Xeon threads w/ ~256 GB Ram
    • ~65 GB of data to upload. (about 5.6 billion cells)

    Problem:
    It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".

    It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.

    The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?

    Question:
    What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.

    What I have tried:

    • bumping ram up to 128 GB in 'h2o.init'
    • using slam, data.table, and options( ...
    • convert to "as.data.frame" before "as.h2o"
    • write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
    • write to sqlite3, too many columns for a table, which is weird.
    • Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)

    Update:
    So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.

    Update2:
    It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.

    解决方案

    Think of as.h2o() as a convenience function, that does these steps:

    1. converts your R data to a data.frame, if not already one.
    2. saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
    3. call h2o.uploadFile() on that temp file
    4. delete the temp file

    As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:

    • With h2o.uploadFile() your client has to be able to see the file.
    • With h2o.importFile() your cluster has to be able to see the file.

    When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)

    Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.

    *: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.

    这篇关于如何快速将数据导入h2o的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆