PySpark 2: KMeans 输入数据不直接缓存 [英] PySpark 2: KMeans The input data is not directly cached
问题描述
我不知道为什么我收到消息
I don't know why I receive the message
WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
当我尝试使用 Spark KMeans
When I try to use Spark KMeans
df_Part = assembler.transform(df_Part)
df_Part.cache()
while (k<=max_cluster) and (wssse > seuilStop):
kmeans = KMeans().setK(k)
model = kmeans.fit(df_Part)
wssse = model.computeCost(df_Part)
k=k+1
它说我的输入(Dataframe)没有被缓存!!
It says that my input (Dataframe) is not cached !!
我尝试打印 df_Part.is_cached 并且我收到 True 这意味着我的数据帧被缓存了,那么为什么 Spark 仍然警告我这个?
I tried to print df_Part.is_cached and I received True which means that my dataframe is cached, So why Spark still warns me about this?
推荐答案
此消息由 oasmllib.clustering.KMeans
生成,如果不修补 Spark 代码,您将无法真正了解它.
This message is generated by the o.a.s.mllib.clustering.KMeans
and there is nothing you can really about it without patching Spark code.
内部o.a.s.ml.clustering.KMeans
:
- 将
DataFrame
转换为RDD[o.a.s.mllib.linalg.Vector]
. - 执行
o.a.s.mllib.clustering.KMeans
.
- Converts
DataFrame
toRDD[o.a.s.mllib.linalg.Vector]
. - Executes
o.a.s.mllib.clustering.KMeans
.
当你缓存 DataFrame
时,内部使用的 RDD
不会被缓存.这就是您看到警告的原因.虽然很烦人,但我不会太担心.
While you cache DataFrame
, RDD
which is used internally is not cached. This is why you see the warning. While it is annoying I wouldn't worry to much about it.
这篇关于PySpark 2: KMeans 输入数据不直接缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!