如何从按月份分区的实木复合地板文件中删除特定月份 [英] How to delete a particular month from a parquet file partitioned by month

查看:173
本文介绍了如何从按月份分区的实木复合地板文件中删除特定月份的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我拥有过去5年的monthly收入数据,并将各个月的数据框存储在

I am having monthly Revenue data for the last 5 years and I am storing the DataFrames for respective months in parquet formats in append mode, but partitioned by month column. Here is the pseudo-code below -

def Revenue(filename):
    df = spark.read.load(filename)
    .
    .
    df.write.format('parquet').mode('append').partitionBy('month').save('/path/Revenue')

Revenue('Revenue_201501.csv')
Revenue('Revenue_201502.csv')
Revenue('Revenue_201503.csv')
Revenue('Revenue_201504.csv')
Revenue('Revenue_201505.csv')

df每月以parquet格式存储,如下所示-

The df gets stored in parquet format on monthly basis, as can be seen below -

问题:如何删除与特定月份相对应的parquet文件夹?

Question: How can I delete the parquet folder corresponding to a particular month?

一种方法是将所有这些parquet文件加载到较大的df中,然后使用.where()子句过滤出该特定月份,然后将其保存回parquet格式partitionBy month in c13>模式,就像这样-

One way would be to load all these parquet files in a big df and then use .where() clause to filter out that particular month and then save it back into parquet format partitionBy month in overwrite mode, like this -

# If we want to remove data from Feb, 2015
df = spark.read.format('parquet').load('Revenue.parquet')
df = df.where(col('month') != lit('2015-02-01'))
df.write.format('parquet').mode('overwrite').partitionBy('month').save('/path/Revenue')

但是,这种方法很麻烦.

But, this approach is quite cumbersome.

其他方法是直接删除该月份的文件夹,但是我不确定这是否是正确的处理方式,以免我们以一种无法预见的方式更改metadata.

Other way is to directly delete the folder of that particular month, but I am not sure if that's a right way to approach things, lest we alter the metadata in an unforseeable way.

删除特定月份的parquet数据的正确方法是什么?

What would be the right way to delete the parquet data for a particular month?

推荐答案

以下语句将仅删除与分区信息相关的元数据.

Below statement will only delete the metadata related to partition information.

ALTER TABLE db.yourtable DROP IF EXISTS PARTITION(loaded_date="2019-08-22");

如果还希望删除数据,则需要将配置单元外部表的tblproperties设置为False.它将您的配置单元表设置为托管表.

you need to set the tblproperties for your hive external table as False, if you want to delete the data as well. It will set your hive table as managed table.

alter table db.yourtable set TBLPROPERTIES('EXTERNAL'='FALSE');

您可以将其设置回外部表.

you can set it back to external table.

alter table db.yourtable set TBLPROPERTIES('EXTERNAL'='TRUE');

我尝试使用spark会话设置给定属性,但是遇到了一些问题.

I tried setting given properties using spark session but was facing some issues.

 spark.sql("""alter table db.test_external set tblproperties ("EXTERNAL"="TRUE")""")
pyspark.sql.utils.AnalysisException: u"Cannot set or change the preserved property key: 'EXTERNAL';"

我确信必须有某种方法可以做到这一点.我最终使用了python.我在pyspark中定义了以下函数,它就完成了工作.

I am sure there must be someway to do this. I ended up using python. I defined below function in pyspark and it did the job.

query=""" hive -e 'alter table db.yourtable set tblproperties ("EXTERNAL"="FALSE");ALTER TABLE db.yourtable DROP IF EXISTS PARTITION(loaded_date="2019-08-22");' """

def delete_partition():
        print("I am here")
        import subprocess
        import sys
        p=subprocess.Popen(query,shell=True,stderr=subprocess.PIPE)
        stdout,stderr = p.communicate()
        if p.returncode != 0:
            print stderr
            sys.exit(1) 

>>> delete_partition()

这将同时删除元数据和数据. 笔记.我已经使用Hive ORC外部分区表对此进行了测试,该表在loaded_date上进行了分区

This will delete the metadata and data both. Note. I have tested this with Hive ORC external partition table, which is partitioned on loaded_date

# Partition Information
# col_name              data_type               comment

loaded_date             string

更新: 基本上,您的数据位于名为

Update: Basically your data is lying at hdfs location in subdirectory named as

/Revenue/month=2015-02-01
/Revenue/month=2015-03-01
/Revenue/month=2015-03-01

以此类推

def delete_partition(month_delete):
      print("I am here")
      hdfs_path="/some_hdfs_location/Revenue/month="
      final_path=hdfs_path+month_delete
      import subprocess
      subprocess.call(["hadoop", "fs", "-rm", "-r", final_path])
      print("got deleted")

delete_partition("2015-02-01")

这篇关于如何从按月份分区的实木复合地板文件中删除特定月份的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆