过滤火花分区表在 Pyspark 中不起作用 [英] Filtering a spark partitioned table is not working in Pyspark
问题描述
我正在使用 spark 2.3 并使用 pyspark 中的数据帧编写器类方法编写了一个数据帧来创建配置单元分区表.
I am using spark 2.3 and have written one dataframe to create hive partitioned table using dataframe writer class method in pyspark.
newdf.coalesce(1).write.format('orc').partitionBy('veh_country').mode("overwrite").saveAsTable('emp.partition_Load_table')
这是我的表结构和分区信息.
Here is my table structure and partitions information.
hive> desc emp.partition_Load_table;
OK
veh_code varchar(17)
veh_flag varchar(1)
veh_model smallint
veh_country varchar(3)
# Partition Information
# col_name data_type comment
veh_country varchar(3)
hive> show partitions partition_Load_table;
OK
veh_country=CHN
veh_country=USA
veh_country=RUS
现在我正在数据帧内的 pyspark 中读取此表.
Now I am reading this table back in pyspark inside a dataframe.
df2_data = spark.sql("""
SELECT *
from udb.partition_Load_table
""");
df2_data.show() --> is working
但我无法使用分区键列对其进行过滤
But I am not able to filter it using partition key column
from pyspark.sql.functions import col
newdf = df2_data.where(col("veh_country")=='CHN')
我收到以下错误消息:
: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive.
You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem,
however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK
Caused by: MetaException(message:Filtering is supported only on partition keys of type string)
而当我通过指定表的 hdfs 绝对路径来创建数据框时.filter 和 where 子句按预期工作.
whereas when I am creating dataframe by specifying the hdfs absolute path of table. filter and where clause is working as expected.
newdataframe = spark.read.format("orc").option("header","false").load("hdfs/path/emp.db/partition_load_table")
下面正在运行
newdataframe.where(col("veh_country")=='CHN').show()
我的问题是为什么它不能首先过滤数据框.以及为什么它会抛出错误消息仅在字符串类型的分区键上支持过滤",即使我的 veh_country 被定义为字符串或 varchar 数据类型.
my question is that why it was not able to filter the dataframe in first place. and also why it's throwing an error message " Filtering is supported only on partition keys of type string " even though my veh_country is defined as string or varchar datatypes.
推荐答案
我也偶然发现了这个问题.对我有帮助的是做这行:
I have stumbled on this issue also. What helped for me was to do this line:
spark.sql("SET spark.sql.hive.manageFilesourcePartitions=False")
然后使用 spark.sql(query)
而不是使用数据框.
and then use spark.sql(query)
instead of using dataframe.
我不知道幕后发生了什么,但这解决了我的问题.
I do not know what happens under the hood, but this solved my problem.
虽然对你来说可能为时已晚(因为这个问题是在 8 个月前提出的),但这可能对其他人有帮助.
Although it might be too late for you (since this question was asked 8 months ago), this might help for other people.
这篇关于过滤火花分区表在 Pyspark 中不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!