如何使用databricks中的scala从dat文件中跳过第一行和最后一行,并将其移至dataframe [英] How to skip first and last line from a dat file and make it to dataframe using scala in databricks
本文介绍了如何使用databricks中的scala从dat文件中跳过第一行和最后一行,并将其移至dataframe的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
H|*|D|*|PA|*|BJ|*|S|*|2019.05.27 08:54:24|##|
H|*|AP_ATTR_ID|*|AP_ID|*|OPER_ID|*|ATTR_ID|*|ATTR_GROUP|*|LST_UPD_USR|*|LST_UPD_TSTMP|##|
779045|*|Sar|*|SUPERVISOR HIERARCHY|*|Supervisor|*|2|*|128|*|2019.05.14 16:48:16|##|
779048|*|KK|*|SUPERVISOR HIERARCHY|*|Supervisor|*|2|*|116|*|2019.05.14 16:59:02|##|
779054|*|Nisha - A|*|EXACT|*|CustomColumnRow120|*|2|*|1165|*|2019.05.15 12:11:48|##|
T|*||*|2019.05.27 08:54:28|##|
文件名为PA.dat.
file name is PA.dat.
我需要跳过文件的第一行和最后一行.文件的第二行是列名.现在,我需要使用columnanme制作一个数据框,并使用scala跳过这两行.
I need to skip first line and also last line of the file.second line of the file is column name. Now I need to make a dataframe with columnanme and skipping those two line using scala.
N.B-由于它不是列名的一部分,因此也需要从第二行跳过该"H".
N.B - need to skip that 'H' from second line also as it is not part of column name.
请帮助我.
推荐答案
类似的东西.我不知道sql.functions是否可以将数组拆分为列,所以我使用rdd做到了.
Something like that. I don't know if sql.functions could split array into columns, so i did it using rdd.
import java.util.regex.Pattern
import org.apache.spark.sql.RowFactory
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.functions._
val data = spark.read
.text("data/PA.dat")
.toDF("val")
.withColumn("id", monotonically_increasing_id())
val count = data.count()
val header = data.where('id === 1).collect().map(s => s.getString(0)).apply(0)
val columns = header
.replace("H|*|", "")
.replace("|##|", "")
.replace("|*|", ",")
.split(",")
val columnDelimiter = Pattern.quote("|*|")
val correctData = data.where('id > 1 && 'id < count - 1)
.select(regexp_replace('val, columnDelimiter, ",").as("val"))
val splitIntoCols = correctData.rdd.map(s=>{
val arr = s.getString(0).split(",")
RowFactory.create(arr:_*)
})
val struct = StructType(columns.map(s=>StructField(s, StringType, true)))
val finalDF = spark.createDataFrame(splitIntoCols,struct)
finalDF.show()
+----------+---------+--------------------+------------------+----------+-----------+--------------------+
|AP_ATTR_ID| AP_ID| OPER_ID| ATTR_ID|ATTR_GROUP|LST_UPD_USR| LST_UPD_TSTMP|
+----------+---------+--------------------+------------------+----------+-----------+--------------------+
| 779045| Sar|SUPERVISOR HIERARCHY| Supervisor| 2| 128|2019.05.14 16:48:...|
| 779048| KK|SUPERVISOR HIERARCHY| Supervisor| 2| 116|2019.05.14 16:59:...|
| 779054|Nisha - A| EXACT|CustomColumnRow120| 2| 1165|2019.05.15 12:11:...|
+----------+---------+--------------------+------------------+----------+-----------+--------------------+
这篇关于如何使用databricks中的scala从dat文件中跳过第一行和最后一行,并将其移至dataframe的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文