为什么df.limit在Pyspark中会不断变化? [英] Why does df.limit keep changing in Pyspark?

查看：606 发布时间：2020/9/4 6:15:24 apache-spark pyspark spark-dataframe

本文介绍了为什么df.limit在Pyspark中会不断变化?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用

从某些数据框df创建数据样本

I'm creating a data sample from some dataframe df with

rdd = df.limit(10000).rdd

此操作需要花费一些时间(为什么呢?在10000行之后它不能捷径吗?)，所以我假设我现在有一个新的RDD.

This operation takes quite some time (why actually? can it not short-cut after 10000 rows?), so I assume I have a new RDD now.

但是，当我现在使用rdd时，每次访问它时，它都是不同的行.好像它再次采样一样.缓存RDD会有所帮助，但是肯定不能保存吗?

However, when I now work on rdd, it is different rows every time I access it. As if it resamples over again. Caching the RDD helps a bit, but surely that's not save?

其背后的原因是什么?

更新:这是Spark 1.5.2的复制品

Update: Here is a reproduction on Spark 1.5.2

from operator import add
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],100)
rdd1=rdd.toDF().limit(1000).rdd
for _ in range(3):
    print(rdd1.map(lambda row:row.i).reduce(add))

输出为

499500
19955500
49651500

.rdd不能修复数据令我感到惊讶.

I'm surprised that .rdd doesn't fix the data.

为了说明它比重新执行问题更棘手，这里有一个动作在Spark 2.0.0.2.5.0上产生了不正确的结果

To show that it get's more tricky than the re-execution issue, here is a single action which produces incorrect results on Spark 2.0.0.2.5.0

from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200)
rdd1=rdd.toDF().limit(12345).rdd
rdd2=rdd1.map(lambda x:(x,x))
rdd2.join(rdd2).count()
# result is 10240 despite doing a self-join

基本上，每当您使用limit时，结果都可能有错误.我并不是说只是许多样本之一"，而是真的不正确(因为在这种情况下结果应始终为12345).

Basically, whenever you use limit your results might be potentially wrong. I don't mean "just one of many samples", but really incorrect (since in the case the result should always be 12345).

为什么df.limit在Pyspark中会不断变化? [英] Why does df.limit keep changing in Pyspark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么df.limit在Pyspark中会不断变化? [英] Why does df.limit keep changing in Pyspark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭