如何使用 saveAsTextFile 在 spark 数据框中进行自定义分区 [英] How to do custom partition in spark dataframe with saveAsTextFile

查看:66
本文介绍了如何使用 saveAsTextFile 在 spark 数据框中进行自定义分区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在Spark中创建了数据,然后进行了join操作,最后我必须将输出保存到分区文件中.

I have created data in Spark and then performed a join operation, finally I have to save the output to partitioned files.

我正在将数据帧转换为 RDD,然后保存为允许我使用多字符分隔符的文本文件.我的问题是在这种情况下如何使用数据框列作为自定义分区.

I am converting data frame into RDD and then saving as text file that allows me to use multi-char delimiter. My question is to how use dataframe columns as custom partition in this case.

我不能为自定义分区使用以下选项,因为它不支持多字符分隔符:

I can not use below option for custom partition because it does not support multi-char delimiter:

dfMainOutput.write.partitionBy("DataPartiotion","StatementTypeCode")
  .format("csv")
  .option("delimiter", "^")
  .option("nullValue", "")
  .option("codec", "gzip")
  .save("s3://trfsdisu/SPARK/FinancialLineItem/output")

为了使用多字符分隔符,我在 RDD 中将其转换为如下代码:

To use multi-char delimiter I have converted this in RDD like below code:

dfMainOutput.rdd.map(x=>x.mkString("|^|")).saveAsTextFile("dir path to store")

但是在上面的选项中,我将如何根据DataPartiotion"和StatementTypeCode"列进行自定义分区?

But in above option how would I do custom partition based on the columns "DataPartiotion" and "StatementTypeCode"?

我是否必须再次从 RDD 转换回数据帧?

Do I have to convert back to again from RDD to a dataframe?

这是我尝试过的代码

val dfMainOutput = df1result.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
      .select($"LineItem_organizationId", $"LineItem_lineItemId",
      when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition_1").as("DataPartition_1"),
        when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
        when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").alias("StatementtypeCode"),
        when($"LineItemName_1".isNotNull, $"LineItemName_1").otherwise($"LineItemName").as("LineItemName"),
        when($"LocalLanguageLabel_1".isNotNull, $"LocalLanguageLabel_1").otherwise($"LocalLanguageLabel").as("LocalLanguageLabel"),
        when($"FinancialConceptLocal_1".isNotNull, $"FinancialConceptLocal_1").otherwise($"FinancialConceptLocal").as("FinancialConceptLocal"),
        when($"FinancialConceptGlobal_1".isNotNull, $"FinancialConceptGlobal_1").otherwise($"FinancialConceptGlobal").as("FinancialConceptGlobal"),
        when($"IsDimensional_1".isNotNull, $"IsDimensional_1").otherwise($"IsDimensional").as("IsDimensional"),
        when($"InstrumentId_1".isNotNull, $"InstrumentId_1").otherwise($"InstrumentId").as("InstrumentId"),
        when($"LineItemSequence_1".isNotNull, $"LineItemSequence_1").otherwise($"LineItemSequence").as("LineItemSequence"),
        when($"PhysicalMeasureId_1".isNotNull, $"PhysicalMeasureId_1").otherwise($"PhysicalMeasureId").as("PhysicalMeasureId"),
        when($"FinancialConceptCodeGlobalSecondary_1".isNotNull, $"FinancialConceptCodeGlobalSecondary_1").otherwise($"FinancialConceptCodeGlobalSecondary").as("FinancialConceptCodeGlobalSecondary"),
        when($"IsRangeAllowed_1".isNotNull, $"IsRangeAllowed_1").otherwise($"IsRangeAllowed".cast(DataTypes.StringType)).as("IsRangeAllowed"),
        when($"IsSegmentedByOrigin_1".isNotNull, $"IsSegmentedByOrigin_1").otherwise($"IsSegmentedByOrigin".cast(DataTypes.StringType)).as("IsSegmentedByOrigin"),
        when($"SegmentGroupDescription".isNotNull, $"SegmentGroupDescription").otherwise($"SegmentGroupDescription").as("SegmentGroupDescription"),
        when($"SegmentChildDescription_1".isNotNull, $"SegmentChildDescription_1").otherwise($"SegmentChildDescription").as("SegmentChildDescription"),
        when($"SegmentChildLocalLanguageLabel_1".isNotNull, $"SegmentChildLocalLanguageLabel_1").otherwise($"SegmentChildLocalLanguageLabel").as("SegmentChildLocalLanguageLabel"),
        when($"LocalLanguageLabel_languageId_1".isNotNull, $"LocalLanguageLabel_languageId_1").otherwise($"LocalLanguageLabel_languageId").as("LocalLanguageLabel_languageId"),
        when($"LineItemName_languageId_1".isNotNull, $"LineItemName_languageId_1").otherwise($"LineItemName_languageId").as("LineItemName_languageId"),
        when($"SegmentChildDescription_languageId_1".isNotNull, $"SegmentChildDescription_languageId_1").otherwise($"SegmentChildDescription_languageId").as("SegmentChildDescription_languageId"),
        when($"SegmentChildLocalLanguageLabel_languageId_1".isNotNull, $"SegmentChildLocalLanguageLabel_languageId_1").otherwise($"SegmentChildLocalLanguageLabel_languageId").as("SegmentChildLocalLanguageLabel_languageId"),
        when($"SegmentGroupDescription_languageId_1".isNotNull, $"SegmentGroupDescription_languageId_1").otherwise($"SegmentGroupDescription_languageId").as("SegmentGroupDescription_languageId"),
        when($"SegmentMultipleFundbDescription_1".isNotNull, $"SegmentMultipleFundbDescription_1").otherwise($"SegmentMultipleFundbDescription").as("SegmentMultipleFundbDescription"),
        when($"SegmentMultipleFundbDescription_languageId_1".isNotNull, $"SegmentMultipleFundbDescription_languageId_1").otherwise($"SegmentMultipleFundbDescription_languageId").as("SegmentMultipleFundbDescription_languageId"),
        when($"IsCredit_1".isNotNull, $"IsCredit_1").otherwise($"IsCredit".cast(DataTypes.StringType)).as("IsCredit"),
        when($"FinancialConceptLocalId_1".isNotNull, $"FinancialConceptLocalId_1").otherwise($"FinancialConceptLocalId").as("FinancialConceptLocalId"),
        when($"FinancialConceptGlobalId_1".isNotNull, $"FinancialConceptGlobalId_1").otherwise($"FinancialConceptGlobalId").as("FinancialConceptGlobalId"),
        when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
        when($"FFAction_1".isNotNull, $"FFAction_1").otherwise((concat(col("FFAction"), lit("|!|"))).as("FFAction")))
        .filter(!$"FFAction".contains("D"))

val dfMainOutputFinal = dfMainOutput.select(concat_ws("|^|", columns.map(c => col(c)): _*).as("concatenated"))

   dfMainOutputFinal.write.partitionBy("DataPartition_1","StatementTypeCode")
  .format("csv")
  .option("codec", "gzip")
  .save("s3://trfsdisu/SPARK/FinancialLineItem/output")

推荐答案

这可以通过使用 concat_ws 来完成,这个函数的工作方式与 mkString 类似,但可以在直接在数据框上.这使得到 rdd 的转换步骤变得多余,并且可以使用 df.write.partitionBy() 方法.一个将连接所有可用列的小例子,

This can be done by using concat_ws, this function works similarly to mkString but can be performed on directly on dataframe. This makes the conversion step to rdd redundant and the df.write.partitionBy() method can be used. A small example that will concatenate all available columns,

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(("01", "20000", "45.30"), ("01", "30000", "45.30"))
  .toDF("col1", "col2", "col3")

val df2 = df.select($"DataPartiotion", $"StatementTypeCode",
  concat_ws("|^|", df.schema.fieldNames.map(c => col(c)): _*).as("concatenated"))

这会给你一个这样的结果数据框,

This will give you a resulting dataframe like this,

+--------------+-----------------+------------------+
|DataPartiotion|StatementTypeCode|      concatenated|
+--------------+-----------------+------------------+
|            01|            20000|01|^|20000|^|45.30|
|            01|            30000|01|^|30000|^|45.30|
+--------------+-----------------+------------------+

这篇关于如何使用 saveAsTextFile 在 spark 数据框中进行自定义分区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆