H2O-R:在H2OFrame的每一行上应用自定义库函数 [英] H2O-R: Apply custom library function on each row of H2OFrame

查看:121
本文介绍了H2O-R:在H2OFrame的每一行上应用自定义库函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的计算机上将相对较大的表从MySQL导入到H2O之后,我尝试运行哈希算法(

After importing a relatively big table from MySQL into H2O on my machine, I tried to run a hashing algorithm (murmurhash from the R digest package) on one of its columns and save it back to H2O. As I found out, using as.data.frame on a H2OFrame object is not always advised: originally my H2OFrame is ~43k rows large, but the coerced DataFrame contains usually only ~30k rows for some reason (the same goes for using base::apply/base::sapply/etc on the H2OFrame).

我发现H2OFrames也有一个apply函数,但是如我所见,它只能与内置的R函数一起使用.

I found out there is an apply function used for H2OFrames as well, but as I see, it can only be used with built-in R functions.

例如,我的代码如下:

data[, "subject"] <- h2o::apply(data[, "subject"], 2, function(x) 
                                digest(x, algo = "murmur32"))

我收到以下错误:

Error in .process.stmnt(stmnt, formalz, envs) : 
  Don't know what to do with statement: digest

我了解到这样一个事实,即只有Java后端中的预定义函数可以用于处理H2O数据,但是也许还有另一种方法可以从客户端使用摘要包而不将数据转换为DataFrame吗?我当时想,在最坏的情况下,我将不得不使用R-MySQL驱动程序首先加载数据,将其作为DataFrame进行处理,然后再将其上传到H2O云中.预先感谢您的帮助.

I understand the fact that only the predefined functions from the Java backend can be used to manipulate H2O data, but is there perhaps another way to use the digest package from the client side without converting the data to DataFrame? I was thinking that in the worst case, I will have to use the R-MySQL driver to load the data first, manipulate it as a DataFrame and then upload it to the H2O cloud. Thanks for help in advance.

推荐答案

由于H2O的工作方式,它不能支持将任意函数应用于常规R data.frame的方式来支持应用于H2OFrames的任意用户定义函数. .我们已经在H2O后端中使用了Murmur哈希函数,因此我添加了 JIRA票据将其公开给H2O R和Python API.同时,我建议将H2O群集中仅关注的单个列复制到R中,应用digest函数,然后使用结果更新H2OFrame.

Due to the way H2O works, it cannot support arbitrary user-defined functions applied to H2OFrames the way that you can apply any function to a regular R data.frame. We already use the Murmur hash function in the H2O backend, so I have added a JIRA ticket to expose it to the H2O R and Python APIs. What I would recommend in the meantime is to copy just the single column of interest from the H2O cluster into R, apply the digest function and then update the H2OFrame with the result.

以下代码会将"subject"列作为1列data.frame拖入R.然后,您可以使用基本R apply函数将杂凑哈希应用于每一行,最后可以将生成的1列data.frame复制回原始H2OFrame中称为data"subject"列中.

The following code will pull the "subject" column into R as a 1-column data.frame. You can then use the base R apply function to apply the murmur hash to every row, and lastly you can copy the resulting 1-column data.frame back into the "subject" column in your original H2OFrame, called data.

sub <- as.data.frame(data[, "subject"])
subhash <- apply(sub, 1, digest, algo = "murmur32")
data[, "subject"] <- as.h2o(subhash)

由于您只有43,000行,因此我希望即使在普通笔记本电脑上,您仍然能够做到这一点,因为您只需要将H2O群集中的单个列复制到R内存(而不是整个数据帧).

Since you only have 43k rows, I would expect that you'd still be able to do this with no issues on even a mediocre laptop since you are only copying a single column from the H2O cluster to R memory (rather than the entire data frame).

这篇关于H2O-R:在H2OFrame的每一行上应用自定义库函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆