由于 .在spark的列名中 [英] Extracting value from data frame thorws error because of the . in the column name in spark

查看:45
本文介绍了由于 .在spark的列名中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我现有的数据框

+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+--------------------------+-----------+--------------------+-----------+--------------------------------------------------------------------------------------------+-----------------------+------------------+-----------------------------+-----------------------+----------------------------------+
|DataPartition     |TimeStamp                |_lineItemId|_organizationId|fl:FinancialConceptGlobal|fl:FinancialConceptGlobalId|fl:FinancialConceptLocal|fl:FinancialConceptLocalId|fl:InstrumentId|fl:IsCredit|fl:IsDimensional|fl:IsRangeAllowed|fl:IsSegmentedByOrigin|fl:SegmentGroupDescription|fl:Segments|fl:StatementTypeCode|FFAction|!||LineItemName                                                                                |LineItemName.languageId|LocalLanguageLabel|LocalLanguageLabel.languageId|SegmentChildDescription|SegmentChildDescription.languageId|
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+--------------------------+-----------+--------------------+-----------+--------------------------------------------------------------------------------------------+-----------------------+------------------+-----------------------------+-----------------------+----------------------------------+
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|3          |4298009288     |XTOT                     |3016350                    |null                    |null                      |null           |true       |false           |false            |false                 |null                      |null       |BAL                 |I|!|       |Total Assets                                                                                |505074                 |null              |null                         |null                   |null                              |

这是上面数据框的模式

root
 |-- DataPartition: string (nullable = true)
 |-- TimeStamp: string (nullable = true)
 |-- _lineItemId: long (nullable = true)
 |-- _organizationId: long (nullable = true)
 |-- fl:FinancialConceptGlobal: string (nullable = true)
 |-- fl:FinancialConceptGlobalId: long (nullable = true)
 |-- fl:FinancialConceptLocal: string (nullable = true)
 |-- fl:FinancialConceptLocalId: long (nullable = true)
 |-- fl:InstrumentId: long (nullable = true)
 |-- fl:IsCredit: boolean (nullable = true)
 |-- fl:IsDimensional: boolean (nullable = true)
 |-- fl:IsRangeAllowed: boolean (nullable = true)
 |-- fl:IsSegmentedByOrigin: boolean (nullable = true)
 |-- fl:SegmentGroupDescription: string (nullable = true)
 |-- fl:Segments: struct (nullable = true)
 |    |-- fl:SegmentSequence: struct (nullable = true)
 |    |    |-- _VALUE: long (nullable = true)
 |    |    |-- _segmentId: long (nullable = true)
 |-- fl:StatementTypeCode: string (nullable = true)
 |-- FFAction|!|: string (nullable = true)
 |-- LineItemName: string (nullable = true)
 |-- LineItemName.languageId: long (nullable = true)
 |-- LocalLanguageLabel: string (nullable = true)
 |-- LocalLanguageLabel.languageId: long (nullable = true)
 |-- SegmentChildDescription: string (nullable = true)
 |-- SegmentChildDescription.languageId: long (nullable = true)

我想使用以下代码重命名数据框的标题列.

I want to rename the header columns of the data frame using below code .

 val temp = dfTypeNew.select(dfTypeNew.columns.filter(x => !x.equals("fl:Segments")).map(x => col(x).as(x.replace("_", "LineItem_").replace("fl:", ""))): _*)

当我这样做时,我得到以下错误

When I do that I get below error

线程main"org.apache.spark.sql.AnalysisException 中的异常:无法从 LineItemName#368 中提取值:需要结构类型但得到字符串;

Exception in thread "main" org.apache.spark.sql.AnalysisException: Can't extract value from LineItemName#368: need struct type but got string;

当我在没有 的情况下重命名我的列时. 我能够提取

When I rename my columns without . I am able to extract

推荐答案

出现错误是因为 (.)dot 用于访问 struct 字段要读取具有列名的字段,请使用如下反引号

The error is there because (.)dot is used to access a struct field To read a field that has a column name use backticks as below

  val df = Seq(
    ("a","b","c"),
    ("a","b","c")
  ).toDF("x", "y", "z.z")

  df.select("x", "`z.z`").show(false)

输出

+---+---+
|a  |c.c|
+---+---+
|a  |c  |
|a  |c  |
+---+---+

希望这会有所帮助!

由拉梅什编辑

@Anupam,您所要做的就是使用 Shankar 在您的代码中建议的上述技术

@Anupam, all you had to do was use the above technique that Shankar suggested in your code as

val temp = dfTypeNew.select(dfTypeNew.columns.filter(x => !x.equals("fl:Segments")).map(x => col(s"`${x}`").as(x.replace("_", "LineItem_").replace("fl:", ""))): _*)

仅此而已.

这篇关于由于 .在spark的列名中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆