运行count命令时出现火花不一致 [英] spark inconsistency when running count command

查看：79 发布时间：2020/9/4 20:19:31 count pyspark spark-dataframe

本文介绍了运行count命令时出现火花不一致的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有关Spark计算不一致的问题.是否存在?例如，我两次运行完全相同的命令，例如:

A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.:

imp_sample.where(col("location").isNotNull()).count()

每次运行时，我得到的结果都会略有不同(141,830，然后是142,314)！或这样:

And I am getting slightly different results every time I run it (141,830, then 142,314)! Or this:

imp_sample.where(col("location").isNull()).count()

得到2,587,013，然后是2,586,943.怎么可能呢? 谢谢！

and getting 2,587,013, and then 2,586,943. How is it even possible? Thank you!

推荐答案

根据您的评论，您在管道中使用sampleBy. sampleBy不保证您将获得行的确切分数.它抽取一个样本，每个记录包含概率等于分数，并且每次运行之间可能会有所不同.

As per your comment, you are using sampleBy in your pipeline. sampleBydoesn't guarantee you'll get the exact fractions of rows. It takes a sample with probability for each record being included equal to fractions and can vary from run to run.

关于注释中的monotonically_increasing_id问题，它仅保证下一个ID大于上一个ID，但是不保证ID是连续的(i，i + i，i + 2等). ..).

Regarding your monotonically_increasing_id question in the comments, it only guarantees that the next id is larger than the previous one, however, it doesn't guarantee ids are consecutive (i,i+i,i+2, etc...).

最后，您可以通过在其上调用persist()来持久化数据帧.

Finally, you can persist a data frame, by called persist() on it.

这篇关于运行count命令时出现火花不一致的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

运行count命令时出现火花不一致 [英] spark inconsistency when running count command

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

运行count命令时出现火花不一致 [英] spark inconsistency when running count command

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭