运行计数命令时引发不一致 [英] spark inconsistency when running count command

查看:28
本文介绍了运行计数命令时引发不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

关于 Spark 计算不一致的问题.这存在吗?例如,我两次运行完全相同的命令,例如:

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.:

imp_sample.where(col("location").isNotNull()).count()

每次运行时我都会得到略有不同的结果(141,830,然后是 142,314)!或者这个:

And I am getting slightly different results every time I run it (141,830, then 142,314)! Or this:

imp_sample.where(col("location").isNull()).count()

然后得到 2,587,013,然后是 2,586,943.怎么可能?谢谢!

and getting 2,587,013, and then 2,586,943. How is it even possible? Thank you!

推荐答案

根据您的评论,您正在管道中使用 sampleBy.sampleBy 不保证您会获得准确的行分数.它需要一个样本,每个记录被包含的概率等于分数,并且可能因运行而异.

As per your comment, you are using sampleBy in your pipeline. sampleBydoesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run.

关于您在评论中的 monotonically_increasing_id 问题,它只保证下一个 id 比前一个大,但是,它不保证 id 是连续的 (i,i+i,i+2,等等...).

Regarding your monotonically_increasing_id question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...).

最后,您可以通过在其上调用 persist() 来持久化数据帧.

Finally, you can persist a data frame, by called persist() on it.

这篇关于运行计数命令时引发不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆