运行count命令时出现火花不一致 [英] spark inconsistency when running count command
问题描述
有关Spark计算不一致的问题.是否存在?例如,我两次运行完全相同的命令,例如:
A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.:
imp_sample.where(col("location").isNotNull()).count()
每次运行时,我得到的结果都会略有不同(141,830,然后是142,314)! 或这样:
And I am getting slightly different results every time I run it (141,830, then 142,314)! Or this:
imp_sample.where(col("location").isNull()).count()
得到2,587,013,然后是2,586,943.怎么可能呢? 谢谢!
and getting 2,587,013, and then 2,586,943. How is it even possible? Thank you!
推荐答案
根据您的评论,您在管道中使用sampleBy
. sampleBy
不保证您将获得行的确切分数.它抽取一个样本,每个记录包含概率等于分数,并且每次运行之间可能会有所不同.
As per your comment, you are using sampleBy
in your pipeline. sampleBy
doesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run.
关于注释中的monotonically_increasing_id
问题,它仅保证下一个ID大于上一个ID,但是不保证ID是连续的(i,i + i,i + 2等). ..).
Regarding your monotonically_increasing_id
question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...).
最后,您可以通过在其上调用persist()来持久化数据帧.
Finally, you can persist a data frame, by called persist() on it.
这篇关于运行count命令时出现火花不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!