PySpark 2: KMeans 输入数据不直接缓存 [英] PySpark 2: KMeans The input data is not directly cached

查看:34
本文介绍了PySpark 2: KMeans 输入数据不直接缓存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道为什么我收到消息

I don't know why I receive the message

WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.

当我尝试使用 Spark KMeans

When I try to use Spark KMeans

df_Part = assembler.transform(df_Part)    
df_Part.cache()
while (k<=max_cluster) and (wssse > seuilStop):
                    kmeans = KMeans().setK(k)
                    model = kmeans.fit(df_Part)
                    wssse = model.computeCost(df_Part)
                    k=k+1

它说我的输入(Dataframe)没有被缓存!!

It says that my input (Dataframe) is not cached !!

我尝试打印 df_Part.is_cached 并且我收到 True 这意味着我的数据帧被缓存了,那么为什么 Spark 仍然警告我这个?

I tried to print df_Part.is_cached and I received True which means that my dataframe is cached, So why Spark still warns me about this?

推荐答案

此消息由 oasmllib.clustering.KMeans 生成,如果不修补 Spark 代码,您将无法真正了解它.

This message is generated by the o.a.s.mllib.clustering.KMeans and there is nothing you can really about it without patching Spark code.

内部o.a.s.ml.clustering.KMeans:

  • DataFrame 转换为 RDD[o.a.s.mllib.linalg.Vector].
  • 执行o.a.s.mllib.clustering.KMeans.
  • Converts DataFrame to RDD[o.a.s.mllib.linalg.Vector].
  • Executes o.a.s.mllib.clustering.KMeans.

当你缓存 DataFrame 时,内部使用的 RDD 不会被缓存.这就是您看到警告的原因.虽然很烦人,但我不会太担心.

While you cache DataFrame, RDD which is used internally is not cached. This is why you see the warning. While it is annoying I wouldn't worry to much about it.

这篇关于PySpark 2: KMeans 输入数据不直接缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆