运行count命令时出现火花不一致 [英] spark inconsistency when running count command

查看:79
本文介绍了运行count命令时出现火花不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有关Spark计算不一致的问题.是否存在?例如,我两次运行完全相同的命令,例如:

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.:

imp_sample.where(col("location").isNotNull()).count()

每次运行时,我得到的结果都会略有不同(141,830,然后是142,314)! 或这样:

And I am getting slightly different results every time I run it (141,830, then 142,314)! Or this:

imp_sample.where(col("location").isNull()).count()

得到2,587,013,然后是2,586,943.怎么可能呢? 谢谢!

and getting 2,587,013, and then 2,586,943. How is it even possible? Thank you!

推荐答案

根据您的评论,您在管道中使用sampleBy. sampleBy不保证您将获得行的确切分数.它抽取一个样本,每个记录包含概率等于分数,并且每次运行之间可能会有所不同.

As per your comment, you are using sampleBy in your pipeline. sampleBydoesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run.

关于注释中的monotonically_increasing_id问题,它仅保证下一个ID大于上一个ID,但是不保证ID是连续的(i,i + i,i + 2等). ..).

Regarding your monotonically_increasing_id question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...).

最后,您可以通过在其上调用persist()来持久化数据帧.

Finally, you can persist a data frame, by called persist() on it.

这篇关于运行count命令时出现火花不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆