Pyspark sql count 返回的行数与纯 sql 不同 [英] Pyspark sql count returns different number of rows than pure sql

查看：29 发布时间：2021/11/14 23:13:37 apache-spark hive pyspark pyspark-sql

本文介绍了Pyspark sql count 返回的行数与纯 sql 不同的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经开始在我的一个项目中使用 pyspark.我正在测试不同的命令来探索库的功能，但我发现了一些我不明白的东西.

I've started using pyspark in one of my projects. I was testing different commands to explore functionalities of the library and I found something that I don't understand.

拿这个代码:

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql.dataframe import Dataframe

sc = SparkContext(sc)
hc = HiveContext(sc)

hc.sql("use test_schema")
hc.table("diamonds").count()

最后一个 count() 操作返回 53941 条记录.如果我在 Hive 中运行 select count(*) from diamonds 我得到 53940.

the last count() operation returns 53941 records. If I run instead a select count(*) from diamonds in Hive I got 53940.

pyspark 计数是否包括标题?

Is that pyspark count including the header?

我试图调查:

df = hc.sql("select * from diamonds").collect()
df[0]
df[1]

查看是否包含标题:

df[0] --> Row(carat=None, cut='cut', color='color', clarity='clarity', depth=None, table=None, price=None, x=None, y=None, z=None)
df[1] -- > Row(carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55, price=326, x=3.95, y=3.98, z=2.43)

第 0 个元素看起来不像标题.

The 0th element doesn't look like the header.

有人对此有解释吗?

谢谢！艾尔

推荐答案

当使用陈旧的统计数据来加速计算时，Hive 可能给出错误的计数.要查看这是否是问题所在，请在 Hive 中尝试:

Hive can give incorrect counts when stale statistics are used to speed up calculations. To see if this is the problem, in Hive try:

SET hive.compute.query.using.stats=false;
SELECT COUNT(*) FROM diamonds;

或者，刷新统计信息.如果您的表未分区:

Alternatively, refresh the statistics. If your table is not partitioned:

ANALYZE TABLE diamonds COMPUTE STATISTICS;
SELECT COUNT(*) FROM diamonds;

如果是分区:

ANALYZE TABLE diamonds PARTITION(partition_column) COMPUTE STATISTICS;

再看看你的第一行(df[0] 在你的问题中).它确实看起来像是格式不正确的标题行.

Also take another look at your first row (df[0] in your question). It does look like an improperly formatted header row.

这篇关于Pyspark sql count 返回的行数与纯 sql 不同的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark sql count 返回的行数与纯 sql 不同 [英] Pyspark sql count returns different number of rows than pure sql

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Pyspark sql count 返回的行数与纯 sql 不同 [英] Pyspark sql count returns different number of rows than pure sql

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭