如何从按月份分区的实木复合地板文件中删除特定月份 [英] How to delete a particular month from a parquet file partitioned by month
问题描述
我拥有过去5年的monthly
收入数据,并将各个月的数据框存储在 partitioned by
month
列.这是下面的伪代码-
I am having monthly
Revenue data for the last 5 years and I am storing the DataFrames for respective months in parquet
formats in append
mode, but partitioned by
month
column. Here is the pseudo-code below -
def Revenue(filename):
df = spark.read.load(filename)
.
.
df.write.format('parquet').mode('append').partitionBy('month').save('/path/Revenue')
Revenue('Revenue_201501.csv')
Revenue('Revenue_201502.csv')
Revenue('Revenue_201503.csv')
Revenue('Revenue_201504.csv')
Revenue('Revenue_201505.csv')
df
每月以parquet
格式存储,如下所示-
The df
gets stored in parquet
format on monthly basis, as can be seen below -
问题:如何删除与特定月份相对应的parquet
文件夹?
Question: How can I delete the parquet
folder corresponding to a particular month?
一种方法是将所有这些parquet
文件加载到较大的df
中,然后使用.where()
子句过滤出该特定月份,然后将其保存回parquet
格式partitionBy
month in c13>模式,就像这样-
One way would be to load all these parquet
files in a big df
and then use .where()
clause to filter out that particular month and then save it back into parquet
format partitionBy
month in overwrite
mode, like this -
# If we want to remove data from Feb, 2015
df = spark.read.format('parquet').load('Revenue.parquet')
df = df.where(col('month') != lit('2015-02-01'))
df.write.format('parquet').mode('overwrite').partitionBy('month').save('/path/Revenue')
但是,这种方法很麻烦.
But, this approach is quite cumbersome.
其他方法是直接删除该月份的文件夹,但是我不确定这是否是正确的处理方式,以免我们以一种无法预见的方式更改metadata
.
Other way is to directly delete the folder of that particular month, but I am not sure if that's a right way to approach things, lest we alter the metadata
in an unforseeable way.
删除特定月份的parquet
数据的正确方法是什么?
What would be the right way to delete the parquet
data for a particular month?
推荐答案
以下语句将仅删除与分区信息相关的元数据.
Below statement will only delete the metadata related to partition information.
ALTER TABLE db.yourtable DROP IF EXISTS PARTITION(loaded_date="2019-08-22");
如果还希望删除数据,则需要将配置单元外部表的tblproperties设置为False.它将您的配置单元表设置为托管表.
you need to set the tblproperties for your hive external table as False, if you want to delete the data as well. It will set your hive table as managed table.
alter table db.yourtable set TBLPROPERTIES('EXTERNAL'='FALSE');
您可以将其设置回外部表.
you can set it back to external table.
alter table db.yourtable set TBLPROPERTIES('EXTERNAL'='TRUE');
我尝试使用spark会话设置给定属性,但是遇到了一些问题.
I tried setting given properties using spark session but was facing some issues.
spark.sql("""alter table db.test_external set tblproperties ("EXTERNAL"="TRUE")""")
pyspark.sql.utils.AnalysisException: u"Cannot set or change the preserved property key: 'EXTERNAL';"
我确信必须有某种方法可以做到这一点.我最终使用了python.我在pyspark中定义了以下函数,它就完成了工作.
I am sure there must be someway to do this. I ended up using python. I defined below function in pyspark and it did the job.
query=""" hive -e 'alter table db.yourtable set tblproperties ("EXTERNAL"="FALSE");ALTER TABLE db.yourtable DROP IF EXISTS PARTITION(loaded_date="2019-08-22");' """
def delete_partition():
print("I am here")
import subprocess
import sys
p=subprocess.Popen(query,shell=True,stderr=subprocess.PIPE)
stdout,stderr = p.communicate()
if p.returncode != 0:
print stderr
sys.exit(1)
>>> delete_partition()
这将同时删除元数据和数据. 笔记.我已经使用Hive ORC外部分区表对此进行了测试,该表在loaded_date上进行了分区
This will delete the metadata and data both. Note. I have tested this with Hive ORC external partition table, which is partitioned on loaded_date
# Partition Information
# col_name data_type comment
loaded_date string
更新: 基本上,您的数据位于名为
Update: Basically your data is lying at hdfs location in subdirectory named as
/Revenue/month=2015-02-01
/Revenue/month=2015-03-01
/Revenue/month=2015-03-01
以此类推
def delete_partition(month_delete):
print("I am here")
hdfs_path="/some_hdfs_location/Revenue/month="
final_path=hdfs_path+month_delete
import subprocess
subprocess.call(["hadoop", "fs", "-rm", "-r", final_path])
print("got deleted")
delete_partition("2015-02-01")
这篇关于如何从按月份分区的实木复合地板文件中删除特定月份的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!