在pyspark lambda地图功能中使用keras模型 [英] Using keras model in pyspark lambda map function

查看:386
本文介绍了在pyspark lambda地图功能中使用keras模型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用该模型来预测PySpark中的地图lambda函数中的分数.

I want to use the model to predict scores in map lambda function in PySpark.

def inference(user_embed, item_embed):
    feats = user_embed + item_embed
    dnn_model =  load_model("best_model.h5")
    infer = dnn_model.predict(np.array([feats]), verbose=0, steps=1)
    return infer
iu_score = iu.map(lambda x: Row(userid=x.userid, entryid=x.entryid, score = inference(x.user_embed, x.item_embed)))

运行非常缓慢,并且在代码开始运行后很快陷入了最后阶段.

The running is extremely slow and it stuck at the final stage quickly after code start running.

[Stage 119:==================================================>(4048 + 2) / 4050]

在HTOP监视器中,只有80个核心中的2个处于满负荷工作状态,其他核心似乎无法正常工作. 那么,如何使模型并行预测呢? iu是3亿,所以效率对我来说很重要. 谢谢.

In HTOP monitor, only 2 of 80 cores are in full work load, others core seems not working. So what should I do to making the model predicting in parallel ? The iu is 300 million so the efficiency if important for me. Thanks.

我把verbose = 1,并且出现了预测日志,但似乎预测只是一个接一个,而不是并行预测.

I have turn verbose=1, and the predict log appears, but it seems that the prediction is just one by one , instead of predict in parallel.

推荐答案

在回答期间,我进行了一些研究,发现此问题很有趣. 首先,如果效率真的很重要,请花一点时间在没有Keres的情况下重新编码整个内容.您仍然可以使用高级API进行tensorflow(模型),并花费一点精力来提取参数并将其分配给新模型.无论从包装器框架中的所有大规模实现中都不清楚(TensorFlow是否不够丰富?),升级时您很可能会遇到向后兼容性的问题.真的不建议用于生产.

During the response I researched a little bit and found this question interesting. First, if efficiency is really important, invest a little time on recoding the whole thing without Keres. You still can use the high-level API for tensorflow (Models) and with a little effort to extract the parameters and assign them to the new model. Regardless it is unclear from all the massive implementations in the framework of wrappers (is TensorFlow not a rich enough framework?), you will most likely meet problems with backward compatibility when upgrading. Really not recommended for production.

话虽如此,您能否确切地检查出什么问题,例如-您是否在使用GPU?也许他们超负荷了?您能包装整个东西不超过某些容量并使用优先系统吗?如果没有优先级,则可以使用简单的队列.您还可以检查您是否真的终止了tensorflow的会话,或者同一台机器运行的许多模型是否会干扰其他模型.有很多问题可能是造成这种现象的原因,拥有更多详细信息将是很棒的.

Having said that, can you inspect what is the problem exactly, for instance - are you using GPUs? maybe they are overloaded? Can you wrap the whole thing to not exceed some capacity and use a prioritizing system? You can use a simple queue if not there are no priorities. You can also check if you really terminate tensorflow's sessions or the same machine runs many models that interfere with the others. There are many issues that can be the reason for this phenomena, it will be great to have more details.

关于并行计算-您没有实现能真正为该模型打开线程或进程的任何事情,因此我怀疑pyspark不能独自处理整个事情.可能的实现(老实说,我没有阅读整个pyspark文档)是假设分派的函数运行得足够快并且没有按应有的方式进行分发. PySpark只是map-reduce原理的一种复杂实现方式.分派的函数在单个步骤中扮演映射函数的角色,这可能会给您带来麻烦.尽管它是作为lambda表达式传递的,但是您应该更仔细地检查哪些实例运行缓慢,以及它们在哪些计算机上运行.

Regarding the parallel computation - you didn't implement anything that really opens a thread or a process for this models, so I suspect that pyspark just can't handle the whole thing by its own. Maybe the implementation (honestly I didn't read the whole pyspark documentation) is assuming that the dispatched functions runs fast enough and doesn't distributed as it should. PySpark is simply a sophisticated implementation of map-reduce principles. The dispatched functions plays the role of a mapping function in a single step, which can be problematic for your case. Although it is passed as a lambda expression, you should inspect more carefully which are the instances that are slow, and on which machines they are running.

我强烈建议您执行以下操作:
转到Tensorflow Deplot官方文档,并了解如何真正部署模型.有一个用于与已部署模型进行通信的协议,称为 RPC ,并且还带有一个静态API.然后,使用pyspark您可以包装调用并与提供的模型连接.您可以创建一个池,其中包含所需的模型,可以在pyspark中进行管理,在网络上分发计算,并且从这里开始,天空和cpus/gpus/tpus才是极限(我仍然对天空持怀疑态度).

I strongly recommend you do as follows:
Go to Tensorflow deplot official docs and read how to really deploy a model. There is a protocol for communicating with the deployed models called RPC and also a restful API. Then, using your pyspark you can wrap the calls and connect with the served model. You can create a pool of how many models you want, manage it in pyspark, distribute the computations over a network, and from here the sky and the cpus/gpus/tpus are the limits (I'm still skeptical about the sky).

很高兴收到您关于结果的更新:)您让我感到好奇.

It will be great to get an update from you about the results :) You made me curious.

希望您能在此问题上取得最佳成绩,很好的问题.

I hope you the best with this issue, great question.

这篇关于在pyspark lambda地图功能中使用keras模型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆