Spark SQL优化技术将CSV加载到Hive的Orc格式 [英] Spark sql Optimization Techniques loading csv to orc format of hive

查看:91
本文介绍了Spark SQL优化技术将CSV加载到Hive的Orc格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有90 GB数据在CSV文件中,我正在将这些数据加载到一个临时表中,然后使用select insert命令将其从临时表加载到orc表中,但是将数据转换并加载为orc格式需要花费4个小时在spark sql中有没有可以用来减少这种时间的优化技术.到目前为止,我还没有使用任何优化技术,我只是使用spark sql并将数据从csv文件加载到table(textformat)中,然后从此临时表到兽人表(使用选择插入)使用spark提交为:

Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert) using spark submit as:

    spark-submit \
    --class class-name\
    --jar file

或者我可以在spark提交中添加任何额外的参数以改善优化效果.

or can I add any extra Parameter in spark submit for improving the optimization.

标量代码(样本):

    All Imports
    object demo {
    def main(args: Array[String]) {
    //sparksession with enabled hivesuppport

    var a1=sparksession.sql("load data inpath 'filepath'  overwrite into table table_name")

    var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from    source_table")

    }
    }

推荐答案

我只是使用spark sql并将数据从csv文件加载到表(文本格式),然后从此临时表到兽人表(使用选择插入)

I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)


这里不需要两步过程.


2 step process is not needed here..

  • 像下面的示例一样读取数据框...
val DFCsv = spark.read.format("csv")
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .load("yourcsv")

  • 如果需要,您必须自其大文件后进行 repartition (可能是因为实际未完成4hr的延迟),然后...
    • if needed you have to do repartition(may be this is cause of the actual 4hr delay since you have not done) since its large file and then...
    • dfcsv.repartition(90)表示它将/可以将csv数据重新划分为90个几乎相等的部分.其中90是样本编号.您可以提及自己想要的东西.

      dfcsv.repartition(90) means it will/may repartition the csv data in to 90 almost equal parts. where 90 is sample number. you can mention what ever you want.

            DFCsv.write.format("orc")
          .partitionBy('yourpartitioncolumns')
          .saveAsTable('yourtable')
      

           DFCsv.write.format("orc")
           .partitionBy('yourpartitioncolumns')
           .insertInto('yourtable')
      

      注意:1)对于大数据,您需要进行重新分区以均匀分布数据,这会增加并行度,因此表现.

      Note: 1) For large data you need to do repartition to uniformly distribute the data will increase the parllelism and hence performance.

      2)如果您没有patition列,并且非分区表,则不需要上面的 partitionBy 样品

      2) If you dont have patition columns and is non-partition table then no need of partitionBy in the above samples

      这篇关于Spark SQL优化技术将CSV加载到Hive的Orc格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆