为什么Mongo Spark连接器会为查询返回不同且错误的计数? [英] Why Mongo Spark connector returns different and incorrect counts for a query?

查看:78
本文介绍了为什么Mongo Spark连接器会为查询返回不同且错误的计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在为项目评估Mongo Spark连接器,但结果不一致.我在笔记本电脑上本地使用MongoDB服务器版本3.4.5,Spark(通过PySpark)版本2.2.0,Mongo Spark连接器版本2.11; 2.2.0.对于我的测试数据库,我使用了Enron数据集 http://mongodb -enron-email.s3-website-us-east-1.amazonaws.com/我对Spark SQL查询感兴趣,当我开始运行简单的测试查询以进行计数时,每次运行都会收到不同的计数. 这是我的mongo shell的输出:

I'm evaluating Mongo Spark connector for a project and I'm getting the inconsistent results. I use MongoDB server version 3.4.5, Spark (via PySpark) version 2.2.0, Mongo Spark Connector version 2.11;2.2.0 locally on my laptop. For my test DB I use the Enron dataset http://mongodb-enron-email.s3-website-us-east-1.amazonaws.com/ I'm interested in Spark SQL queries and when I started to run simple test queries for count I received different counts for each run. Here is output from my mongo shell:

> db.messages.count({'headers.To': 'eric.bass@enron.com'})
203

这是我的PySpark shell的一些输出:

Here are some output from my PySpark shell:

In [1]: df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/enron_mail.messages").load()
In [2]: df.registerTempTable("messages")
In [3]: res = spark.sql("select count(*) from messages where headers.To='eric.bass@enron.com'")
In [4]: res.show()
+--------+                                                                      
|count(1)|
+--------+
|     162|
+--------+
In [5]: res.show()
+--------+                                                                      
|count(1)|
+--------+
|     160|
+--------+
In [6]: res = spark.sql("select count(_id) from messages where headers.To='eric.bass@enron.com'")
In [7]: res.show()
+----------+                                                                    
|count(_id)|
+----------+
|       161|
+----------+
In [8]: res.show()
+----------+                                                                    
|count(_id)|
+----------+
|       162|
+----------+

我在Google中搜索了此问题,但没有发现任何帮助.如果有人对为什么会发生以及如何正确处理有任何想法,请分享您的想法.我有一种感觉,就是我错过了某些东西,或者某些东西配置不正确.

I searched in Google about this issue but I didn't find anything helpful. If someone has any ideas why this could happen and how to handle this correctly please share your ideas. I have a feeling that maybe I missed something or maybe something wasn't configured properly.

更新: 我解决了我的问题.计数不一致的原因是 MongoDefaultPartitioner ,它包装了使用随机采样的 MongoSamplePartitioner .老实说,这对我来说是一个很奇怪的默认设置.我个人更希望使用缓慢但一致的分区程序.分区程序选项的详细信息可以在官方的配置选项文档中找到.

UPDATE: I solved my issue. The reason of inconsistent counts was the MongoDefaultPartitioner which wraps MongoSamplePartitioner which uses random sampling. To be honest this is quite a weird default as for me. I personally would prefer to have a slow but a consistent partitioner instead. The details for partitioner options can be found in the official configuration options documentation.

更新: 将解决方案复制到答案中.

UPDATE: Copied the solution into an answer.

推荐答案

我解决了我的问题.计数不一致的原因是 MongoDefaultPartitioner ,它包装了使用随机采样的 MongoSamplePartitioner .老实说,这对我来说是一个很奇怪的默认设置.我个人更希望使用缓慢但一致的分区程序.分区选项的详细信息可以在官方的配置选项文档中找到.

I solved my issue. The reason of inconsistent counts was the MongoDefaultPartitioner which wraps MongoSamplePartitioner which uses random sampling. To be honest this is quite a weird default as for me. I personally would prefer to have a slow but a consistent partitioner instead. The details for partitioner options can be found in the official configuration options documentation.

代码:

val df = spark.read
  .format("com.mongodb.spark.sql.DefaultSource")
  .option("uri", "mongodb://127.0.0.1/enron_mail.messages")
  .option("partitioner", "spark.mongodb.input.partitionerOptions.MongoPaginateBySizePartitioner ")
  .load()

这篇关于为什么Mongo Spark连接器会为查询返回不同且错误的计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆