PySpark 2:KMeans输入数据不直接缓存 [英] PySpark 2: KMeans The input data is not directly cached

查看:133
本文介绍了PySpark 2:KMeans输入数据不直接缓存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道为什么收到消息

I don't know why I receive the message

WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.

当我尝试使用Spark KMeans

When I try to use Spark KMeans

df_Part = assembler.transform(df_Part)    
df_Part.cache()
while (k<=max_cluster) and (wssse > seuilStop):
                    kmeans = KMeans().setK(k)
                    model = kmeans.fit(df_Part)
                    wssse = model.computeCost(df_Part)
                    k=k+1

它说我的输入(数据帧)没有被缓存!!

It says that my input (Dataframe) is not cached !!

我尝试打印df_Part.is_cached并收到True,这意味着我的数据帧已被缓存,那么为什么Spark仍会对此发出警告?

I tried to print df_Part.is_cached and I received True which means that my dataframe is cached, So why Spark still warns me about this?

推荐答案

此消息是由o.a.s.mllib.clustering.KMeans生成的,如果不修补Spark代码,您将无能为力.

This message is generated by the o.a.s.mllib.clustering.KMeans and there is nothing you can really about it without patching Spark code.

内部o.a.s.ml.clustering.KMeans:

  • DataFrame转换为RDD[o.a.s.mllib.linalg.Vector].
  • 执行o.a.s.mllib.clustering.KMeans.
  • Converts DataFrame to RDD[o.a.s.mllib.linalg.Vector].
  • Executes o.a.s.mllib.clustering.KMeans.

虽然缓存了DataFrame,但内部不使用RDD.这就是为什么您看到警告.虽然很烦人,但我不必为此担心.

While you cache DataFrame, RDD which is used internally is not cached. This is why you see the warning. While it is annoying I wouldn't worry to much about it.

这篇关于PySpark 2:KMeans输入数据不直接缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆