H2O无法并行运行 [英] H2O not working on parallel

查看:88
本文介绍了H2O无法并行运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经创建了一个DF,并且想要将其转换为H2O框架.

I have create a DF and want to convert it to H2O Frame.

为此,我这样做:

library(h2o)
h2o.init(nthreads=-1)
df<-data.table(matrix(0,ncol=46,nrow=30000))
df<-as.h2o(df)

当我在命令行上执行htop时,我发现只有四个可用处理器中的一个正在工作.不可能以其他方式吗?

When I do htop on the comand line I see that only one processor of the 4 available are working. It is not possible to do in other way?

谢谢!

推荐答案

这里有两个因素在起作用.

There are two factors at work here.

1)首先是您正在使用as.h2o(),这是一种不是很高效的推送"方法(客户端将数据推送到服务器),用于提取数据.

1) The first is you are using as.h2o(), which is the not-very-efficient "push" method (where the client pushes data to the server) of ingesting data.

这是为了处理小数据和方便(在这种情况下很好,因为您创建的数据集包含30,000行,这是小数据).

This is meant for small data and for convenience (which is fine for this case, because you created a dataset with 30,000 rows, which is small data).

如果希望H2O有效地提取数据,则需要使用拉"方法,其中H2O将数据从数据存储区中提取到H2O的内存中.在R中,这将是h2o.importFile().

If you want H2O to ingest data efficiently, you need to use the "pull" method, where H2O pulls data from the data store into H2O's memory. In R, this would be h2o.importFile().

2)第二个因素是H2O使用数据分块(数据集中的连续行)来获取数据并行性.每列的块数直接影响并行工作的线程数.读入数据集后,如果每列只有1个块,则它只能使用1个线程(因此使用1个核心).通过查看如何在H2O Flow Web UI中解析数据,可以看到每列的块数.

2) The second factor is H2O uses chunking of data (contiguous rows in the dataset) to get data parallelism. The number of chunks per column directly affects the number of threads that work in parallel. Once a dataset is read in, if it only has 1 chunk per column, then it will only be able to use 1 thread (and hence 1 core). You can see the number of chunks per column by looking at how the data was parsed in the H2O Flow Web UI.

我在上面运行了您的程序;查看生成的H2O框架的框架分布摘要"如何显示每列的块数为1:

I ran your program above; see how the Frame Distribution Summary for the resulting H2O Frame shows that the number of chunks per column is 1:

再次运行具有3,000,000行的相同程序,每列将产生66个块:

Running the same program again with 3,000,000 rows gives 66 chunks per column:

这要好得多,因为现在一旦您尝试对H2O中的数据进行处理(例如训练模型),您将在分布式集群上并行运行多达66个线程.

This is much better because now once you try to do stuff with the data in H2O (like train a model) you will get up to 66 threads running in parallel on a distributed cluster.

[请注意,在较大的情况下,数据吸收本身在我的笔记本电脑上花费了几分钟,并且仍然缓慢且是单线程的,因为它使用的是效率低下的as.h2o()推"方法.如果将数据集写到一个csv文件中,并用h2o.importFile()"pull"方法进行H2O解析,它将更快. ]

[ Note for the bigger case, the data ingestion itself took a few minutes on my laptop and was still slow and single-threaded because it's using the inefficient as.h2o() "push" approach. If you wrote the dataset out to a csv file, and had H2O parse it with the h2o.importFile() "pull" approach, it would be much faster. ]

这篇关于H2O无法并行运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆