如何使用Array [Int]将csv文件加载到Spark DataFrame中 [英] How to load the csv file into the Spark DataFrame with Array[Int]

查看：82 发布时间：2020/7/11 23:14:35 scala csv apache-spark

本文介绍了如何使用Array [Int]将csv文件加载到Spark DataFrame中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

csv文件中的每一行的结构如下:

Every row in my csv file is structured like this:

u001, 2013-11, 0, 1, 2, ... , 99

其中 u001 和 2013-11 是UID和日期，从 0 到 99 的数字是数据值.我想将此csv文件以以下结构加载到Spark DataFrame中:

in which u001 and 2013-11 are UID and date, the number from 0 to 99 are the data value. I want to load this csv file into the Spark DataFrame in this structure:

+-------+-------------+-----------------+
|    uid|         date|       dataVector|
+-------+-------------+-----------------+
|   u001|      2013-11|  [0,1,...,98,99]|
|   u002|      2013-11| [1,2,...,99,100]|
+-------+-------------+-----------------+

root
 |-- uid: string (nullable = true)
 |-- date: string (nullable = true)
 |-- dataVecotr: array (nullable = true)
 |    |-- element: integer (containsNull = true)

，其中dataVector是 Array [Int] ，并且 dataVector 的长度对于所有UID和日期都是相同的. 我尝试了几种解决方法，包括

in which dataVector is Array[Int], and the dataVector length is the same for all of the UID and date. I have tried several ways to solve this, including

使用shema

Using shema

val attributes = Array("uid", "date", "dataVector)
val schema = StructType(
StructField(attributes(0), StringType, true) ::
StructField(attributes(1), StringType, true) ::
StructField(attributes(2), ArrayType(IntegerType), true) :: 
Nil)

但是这种方式效果不好.由于我后面的数据集中的数据列大于100，因此我认为手动创建包括dataVector的整个列的架构也很不方便.

But this way didn't work well. For the column of data is larger than 100 in my later dataset, I think it is also inconvenience to create the schema including the whole columns of dataVector manually.

不使用模式直接加载csv文件，并使用

Directly load the csv file without schema, and use the method in concatenate multiple columns into single columns to concatenate the column of the data together, but the schema structure is like

 root
  |-- uid: string (nullable = true)
  |-- date: string (nullable = true)
  |-- dataVector: struct (nullable = true)
  |    |-- _c3: string (containsNull = true)
  |    |-- _c4: string (containsNull = true)
  .
  .
  .
  |    |-- _c101: string (containsNull = true)

这仍然与我所需要的不同，并且我没有找到将这种结构转换为我所需要的方法的方法. 所以我的问题是，如何将csv文件加载到所需的结构中?

This is still different from what I need, and I didn't find way to convert this struct into what I need. So my question is that how could I load the csv file into the structure what I need?

如何使用Array [Int]将csv文件加载到Spark DataFrame中 [英] How to load the csv file into the Spark DataFrame with Array[Int]

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用Array [Int]将csv文件加载到Spark DataFrame中 [英] How to load the csv file into the Spark DataFrame with Array[Int]

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭