PySpark 2:KMeans输入数据不直接缓存 [英] PySpark 2: KMeans The input data is not directly cached
问题描述
我不知道为什么收到消息
I don't know why I receive the message
WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
当我尝试使用Spark KMeans
When I try to use Spark KMeans
df_Part = assembler.transform(df_Part)
df_Part.cache()
while (k<=max_cluster) and (wssse > seuilStop):
kmeans = KMeans().setK(k)
model = kmeans.fit(df_Part)
wssse = model.computeCost(df_Part)
k=k+1
它说我的输入(数据帧)没有被缓存!!
It says that my input (Dataframe) is not cached !!
我尝试打印df_Part.is_cached并收到True,这意味着我的数据帧已被缓存,那么为什么Spark仍会对此发出警告?
I tried to print df_Part.is_cached and I received True which means that my dataframe is cached, So why Spark still warns me about this?
推荐答案
此消息是由o.a.s.mllib.clustering.KMeans
生成的,如果不修补Spark代码,您将无能为力.
This message is generated by the o.a.s.mllib.clustering.KMeans
and there is nothing you can really about it without patching Spark code.
内部o.a.s.ml.clustering.KMeans
:
- 将
DataFrame
转换为RDD[o.a.s.mllib.linalg.Vector]
. - 执行
o.a.s.mllib.clustering.KMeans
.
- Converts
DataFrame
toRDD[o.a.s.mllib.linalg.Vector]
. - Executes
o.a.s.mllib.clustering.KMeans
.
虽然缓存了DataFrame
,但内部不使用RDD
.这就是为什么您看到警告.虽然很烦人,但我不必为此担心.
While you cache DataFrame
, RDD
which is used internally is not cached. This is why you see the warning. While it is annoying I wouldn't worry to much about it.
这篇关于PySpark 2:KMeans输入数据不直接缓存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!