Pyspark SQL计数返回的行数与纯SQL不同 [英] Pyspark sql count returns different number of rows than pure sql
问题描述
我已经开始在其中一个项目中使用pyspark.我测试了不同的命令以探索库的功能,但发现了一些我不理解的东西.
I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand.
使用此代码:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.dataframe import Dataframe
sc = SparkContext(sc)
hc = HiveContext(sc)
hc.sql("use test_schema")
hc.table("diamonds").count()
最后一个 count()操作返回53941条记录.如果我改为在Hive中运行从钻石中选择 count(*),则会得到53940.
the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds in Hive I got 53940.
那个pyspark是否包括头在内?
Is that pyspark count including the header?
我尝试调查:
df = hc.sql("select * from diamonds").collect()
df[0]
df[1]
查看是否包含标题:
df[0] --> Row(carat=None, cut='cut', color='color', clarity='clarity', depth=None, table=None, price=None, x=None, y=None, z=None)
df[1] -- > Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55, price=326, x=3.95, y=3.98, z=2.43)
第0个元素看起来不像标题.
The 0th element doesn't look like the header.
有人对此有解释吗?
谢谢! 啤酒
推荐答案
当使用陈旧的统计信息来加快计算速度时,Hive会给出错误的计数.要查看是否存在此问题,请在Hive中尝试:
Hive can give incorrect counts when stale statistics are used to speed up calculations. To see if this is the problem, in Hive try:
SET hive.compute.query.using.stats=false;
SELECT COUNT(*) FROM diamonds;
或者,刷新统计信息.如果您的表未分区:
Alternatively, refresh the statistics. If your table is not partitioned:
ANALYZE TABLE diamonds COMPUTE STATISTICS;
SELECT COUNT(*) FROM diamonds;
如果已分区:
ANALYZE TABLE diamonds PARTITION(partition_column) COMPUTE STATISTICS;
还可以再看看您的第一行(问题中的df[0]
).它看起来确实像是格式不正确的标题行.
Also take another look at your first row (df[0]
in your question). It does look like an improperly formatted header row.
这篇关于Pyspark SQL计数返回的行数与纯SQL不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!