在 Spark 中读取最后一列作为值数组(并且值在括号内并用逗号分隔)的 CSV [英] Read CSV with last column as array of values (and the values are inside parenthesis and separated by comma) in Spark

查看:32
本文介绍了在 Spark 中读取最后一列作为值数组(并且值在括号内并用逗号分隔)的 CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 CSV 文件,其中最后一列在括号内,值用逗号分隔.最后一列中值的数量是可变的.当我将它们作为带有一些列名的 Dataframe 读取时,如下所示,我在线程main"java.lang.IllegalArgumentException 中得到 Exception: 要求失败:列数不匹配.我的 CSV 文件看起来像这样

I have a CSV file where the last column is inside parenthesis and the values are separated by commas. The number of values is variable in the last column. When I read them to as Dataframe with some column names as follows, I get Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match. My CSV file looks like this

a1,b1,true,2017-05-16T07:00:41.0000000,2.5,(c1,d1,e1)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,f2,g2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,k2,f2)

我最终想要的是这样的:

what I finally want is something like this:

root
 |-- MId: string (nullable = true)
 |-- PId: string (nullable = true)
 |-- IsTeacher: boolean(nullable = true)
 |-- STime: datetype(nullable = true)
 |-- TotalMinutes: double(nullable = true)
 |-- SomeArrayHeader: array<string>(nullable = true)

到目前为止,我已经编写了以下代码:

I have written the following code till now:

val infoDF =
  sqlContext.read.format("csv")
    .option("header", "false")
    .load(inputPath)
    .toDF(
      "MId",
      "PId",
      "IsTeacher",
      "STime",
      "TotalMinutes",
      "SomeArrayHeader")

我想在不给出列名的情况下阅读它们,然后将第 5 列之后的列转换为数组类型.但是后来我遇到了括号问题.有没有办法在阅读并告诉括号内的字段实际上是数组类型的一个字段时执行此操作.

I thought of reading them without giving column names and then cast the columns which are after the 5th columns to array type. But then I am having problems with the parentheses. Is there a way I can do this while reading and telling that fields inside parenthesis are actually one field of type array.

推荐答案

好的.该解决方案仅适用于您的情况.下面的一个对我有用

Ok. The solution is only tactical for your case. The below one worked for me

  val df = spark.read.option("quote", "(").csv("in/staff.csv").toDF(
    "MId",
    "PId",
    "IsTeacher",
    "STime",
    "TotalMinutes",
    "arr")
  df.show()
  val df2 = df.withColumn("arr",split(regexp_replace('arr,"[)]",""),","))
  df2.printSchema()
  df2.show()

输出:

+---+---+---------+--------------------+------------+---------------+
|MId|PId|IsTeacher|               STime|TotalMinutes|            arr|
+---+---+---------+--------------------+------------+---------------+
| a1| b1|     true|2017-05-16T07:00:...|         2.5|      c1,d1,e1)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|c2,d2,e2,f2,g2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|            c2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|         c2,d2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|      c2,d2,e2)|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|c2,d2,e2,k2,f2)|
+---+---+---------+--------------------+------------+---------------+

root
 |-- MId: string (nullable = true)
 |-- PId: string (nullable = true)
 |-- IsTeacher: string (nullable = true)
 |-- STime: string (nullable = true)
 |-- TotalMinutes: string (nullable = true)
 |-- arr: array (nullable = true)
 |    |-- element: string (containsNull = true)

+---+---+---------+--------------------+------------+--------------------+
|MId|PId|IsTeacher|               STime|TotalMinutes|                 arr|
+---+---+---------+--------------------+------------+--------------------+
| a1| b1|     true|2017-05-16T07:00:...|         2.5|        [c1, d1, e1]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|[c2, d2, e2, f2, g2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|                [c2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|            [c2, d2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|        [c2, d2, e2]|
| a2| b2|     true|2017-05-26T07:00:...|         0.5|[c2, d2, e2, k2, f2]|
+---+---+---------+--------------------+------------+--------------------+

这篇关于在 Spark 中读取最后一列作为值数组(并且值在括号内并用逗号分隔)的 CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆