如何从给定字段长度的固定长度文本文件创建DataFrame? [英] How to create DataFrame from fixed-length text file given field lengths?
问题描述
我正在读取固定位置的文件.文件的最终结果存储在字符串中.我想将字符串转换为DataFrame
进行进一步处理.请帮助我.下面是我的代码:
I am reading fixed positional file. Final result of file is stored in string. I would like to convert string into a DataFrame
to process further. Kindly help me on this. Below is my code:
输入数据: + --------- + ---------------------- +
Input data: +---------+----------------------+
| PRGREFNBR |值|
|PRGREFNBR|value |
+ --------- + ---------------------- +
+---------+----------------------+
| 01 | 11 apple TRUE 0.56 |
|01 |11 apple TRUE 0.56|
| 02 | 12梨FALSE1.34 |
|02 |12 pear FALSE1.34|
| 03 | 13树莓TRUE 2.43 |
|03 |13 raspberry TRUE 2.43|
| 04 | 14 plum TRUE .31 |
|04 |14 plum TRUE .31|
| 05 | 15 cherry TRUE 1.4 |
|05 |15 cherry TRUE 1.4 |
+ --------- + ---------------------- +
+---------+----------------------+
数据位置:"3,10,5,4"
在数据框中具有默认标头的预期结果:
expected result with default header in data frame:
+ ----- + ----- + ---------- + ----- + ----- +
+-----+-----+----------+-----+-----+
| SeqNo | col_0 | col_1 | col_2 | col_3 |
|SeqNo|col_0| col_1|col_2|col_3|
+ ----- + ----- + ---------- + ----- + ----- +
+-----+-----+----------+-----+-----+
| 01 | 11 | apple | TRUE | 0.56 |
| 01 | 11 |apple |TRUE | 0.56|
| 02 | 12 |梨|假| 1.34 |
| 02 | 12 |pear |FALSE| 1.34|
| 03 | 13 |树莓|真| 2.43 |
| 03 | 13 |raspberry |TRUE | 2.43|
| 04 | 14 |李子| TRUE | 1.31 |
| 04 | 14 |plum |TRUE | 1.31|
| 05 | 15 |樱桃| TRUE | 1.4 |
| 05 | 15 |cherry |TRUE | 1.4 |
+ ----- + ----- + ---------- + ----- + ----- +
+-----+-----+----------+-----+-----+
推荐答案
给出固定位置的文件(例如input.txt
):
Given the fixed-position file (say input.txt
):
11 apple TRUE 0.56
12 pear FALSE1.34
13 raspberry TRUE 2.43
14 plum TRUE 1.31
15 cherry TRUE 1.4
,输入文件中每个字段的长度为(例如lengths
):
and the length of every field in the input file as (say lengths
):
3,10,5,4
您可以创建一个DataFrame,如下所示:
you could create a DataFrame as follows:
// Read the text file as is
// and filter out empty lines
val lines = spark.read.textFile("input.txt").filter(!_.isEmpty)
// define a helper function to do the split per fixed lengths
// Home exercise: should be part of a case class that describes the schema
def parseLinePerFixedLengths(line: String, lengths: Seq[Int]): Seq[String] = {
lengths.indices.foldLeft((line, Array.empty[String])) { case ((rem, fields), idx) =>
val len = lengths(idx)
val fld = rem.take(len)
(rem.drop(len), fields :+ fld)
}._2
}
// Split the lines using parseLinePerFixedLengths method
val lengths = Seq(3,10,5,4)
val fields = lines.
map(parseLinePerFixedLengths(_, lengths)).
withColumnRenamed("value", "fields") // <-- it'd be unnecessary if a case class were used
scala> fields.show(truncate = false)
+------------------------------+
|fields |
+------------------------------+
|[11 , apple , TRUE , 0.56]|
|[12 , pear , FALSE, 1.34]|
|[13 , raspberry , TRUE , 2.43]|
|[14 , plum , TRUE , 1.31]|
|[15 , cherry , TRUE , 1.4 ]|
+------------------------------+
这可能就是您已经拥有的,所以让我们将嵌套的字段序列展开/解构为列
That's what you may have had already so let's unroll/destructure the nested sequence of fields into columns
val answer = lengths.indices.foldLeft(fields) { case (result, idx) =>
result.withColumn(s"col_$idx", $"fields".getItem(idx))
}
// drop the unnecessary/interim column
scala> answer.drop("fields").show
+-----+----------+-----+-----+
|col_0| col_1|col_2|col_3|
+-----+----------+-----+-----+
| 11 |apple |TRUE | 0.56|
| 12 |pear |FALSE| 1.34|
| 13 |raspberry |TRUE | 2.43|
| 14 |plum |TRUE | 1.31|
| 15 |cherry |TRUE | 1.4 |
+-----+----------+-----+-----+
完成!
这篇关于如何从给定字段长度的固定长度文本文件创建DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!