如何使用Array [Int]将csv文件加载到Spark DataFrame中 [英] How to load the csv file into the Spark DataFrame with Array[Int]

查看:82
本文介绍了如何使用Array [Int]将csv文件加载到Spark DataFrame中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

csv文件中的每一行的结构如下:

Every row in my csv file is structured like this:

u001, 2013-11, 0, 1, 2, ... , 99

其中 u001 2013-11 是UID和日期,从 0 99 的数字是数据值.我想将此csv文件以以下结构加载到Spark DataFrame中:

in which u001 and 2013-11 are UID and date, the number from 0 to 99 are the data value. I want to load this csv file into the Spark DataFrame in this structure:

+-------+-------------+-----------------+
|    uid|         date|       dataVector|
+-------+-------------+-----------------+
|   u001|      2013-11|  [0,1,...,98,99]|
|   u002|      2013-11| [1,2,...,99,100]|
+-------+-------------+-----------------+

root
 |-- uid: string (nullable = true)
 |-- date: string (nullable = true)
 |-- dataVecotr: array (nullable = true)
 |    |-- element: integer (containsNull = true)

,其中dataVector是 Array [Int] ,并且 dataVector 的长度对于所有UID和日期都是相同的. 我尝试了几种解决方法,包括

in which dataVector is Array[Int], and the dataVector length is the same for all of the UID and date. I have tried several ways to solve this, including

  1. 使用shema

  1. Using shema

val attributes = Array("uid", "date", "dataVector)
val schema = StructType(
StructField(attributes(0), StringType, true) ::
StructField(attributes(1), StringType, true) ::
StructField(attributes(2), ArrayType(IntegerType), true) :: 
Nil)

但是这种方式效果不好.由于我后面的数据集中的数据列大于100,因此我认为手动创建包括dataVector的整个列的架构也很不方便.

But this way didn't work well. For the column of data is larger than 100 in my later dataset, I think it is also inconvenience to create the schema including the whole columns of dataVector manually.

  1. 不使用模式直接加载csv文件,并使用

  1. Directly load the csv file without schema, and use the method in concatenate multiple columns into single columns to concatenate the column of the data together, but the schema structure is like

 root
  |-- uid: string (nullable = true)
  |-- date: string (nullable = true)
  |-- dataVector: struct (nullable = true)
  |    |-- _c3: string (containsNull = true)
  |    |-- _c4: string (containsNull = true)
  .
  .
  .
  |    |-- _c101: string (containsNull = true)

这仍然与我所需要的不同,并且我没有找到将这种结构转换为我所需要的方法的方法. 所以我的问题是,如何将csv文件加载到所需的结构中?

This is still different from what I need, and I didn't find way to convert this struct into what I need. So my question is that how could I load the csv file into the structure what I need?

推荐答案

不添加任何内容即可加载

Load it without any additions

val df = spark.read.csv(path)

然后选择:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

// Combine data into array
val dataVector: Column = array(
  df.columns.drop(2).map(col): _*  // Skip first 2 columns
).cast("array<int>")  // Cast to the required type
val cols: Array[Column] = df.columns.take(2).map(col) :+ dataVector

df.select(cols: _*).toDF("uid", "date", "dataVector")

这篇关于如何使用Array [Int]将csv文件加载到Spark DataFrame中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆