如何使用Spark将文本文件拆分为多列 [英] How to split a text file into multiple columns with Spark
问题描述
我很难用定界符'|'分割文本数据文件放入数据框列.我加载的数据文件如下所示:
I'm having difficulty on splitting a text data file with delimiter '|' into data frame columns. My loaded data file looks like this:
results1.show()
+--------------------+
| all|
+--------------------+
|DEPT_NO|ART_GRP_N...|
|29|102|354814|SKO...|
|29|102|342677|SKO...|
|29|102|334634|DUR...|
|29|102|319337|SKO...|
|29|102|316731|DUR...|
|29|102|316728|DUR...|
|29|102|316702|DUR...|
|29|102|316702|DUR...|
|29|102|276728|I-P...|
我已经尝试了以下两种方法,这些方法可以在以前的帖子中找到:
I have tried the following 2 approaches found on previous posts:
results1.select(expr("(split(all, '|'))[1]").cast("integer").as("DEPT_NO"),expr("(split(all, '|'))[4]").cast("integer").as("ART_GRP_NO"), expr("(split(all, '|'))[8]").cast("string").as("ART_NO")).show
+-------+----------+------+
|DEPT_NO|ART_GRP_NO|ART_NO|
+-------+----------+------+
| null| null| ||
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
| 2| 1| 2|
和
val dataframe10= sc.textFile(("D:/data/dnr10.txt")
.toString())
.map(_.split("|"))
.map(c => {(c(1), c(2),c(3),c(4))})
.toDF()
.show()
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| D| E| P| T|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 9| || 1|
| 2| 7| || 4|
由于分隔符是在每个字符之后而不是在每个'|'之后完成的,因此似乎无法识别分隔符.在这种情况下,有人可以给我提示如何进行正确的分割吗?
It looks like the delimiter is not recognized because the splitting is done after each character and not after every '|'. Can somebody give me a hint please on how to get a correct split process in this case?
推荐答案
使用RDD API :您的错误是 String.split
期望使用正则表达式,其中管道("|" >>是一个特殊字符,表示或",因此它会在任何内容上分割.另外-将数组转换为元组时,您应该从索引0开始
Using RDD API: your mistake is that String.split
expects a regular expression, where pipe ("|"
) is a special character meaning "OR", so it splits on anything. Plus - you should start from index 0 when converting the array into a tuple
解决方法很简单-逃脱该字符:
The fix is simple - escape that character:
sc.textFile("D:/data/dnr10.txt")
.map(_.split("\\|"))
.map(c => (c(0),c(1),c(2),c(3)))
.toDF()
使用Dataframe API :此处与转义管道的问题相同.另外,您可以通过拆分一次并在选择列时多次使用该拆分列来简化代码:
Using Dataframe API: the same issue with escaping the pipe applies here. Plus you can simplify the code by splitting once and using that split column multiple times when selecting the columns:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
results1.withColumn("split", split($"all", "\\|")).select(
$"split" getItem 0 cast IntegerType as "DEPT_NO",
$"split" getItem 3 cast IntegerType as "ART_GRP_NO",
$"split" getItem 7 as "ART_NO"
)
使用Spark 2.0内置的CSV支持:如果您使用的是Spark 2.0+,则可以让框架为您完成所有艰苦的工作-使用格式"csv"并设置定界符成为管道字符:
Using Spark 2.0 built-in CSV support: if you're using Spark 2.0+, you can let the framework do all the hard work for you - use format "csv" and set the delimiter to be the pipe character:
val result = sqlContext.read
.option("header", "true")
.option("delimiter", "|")
.option("inferSchema", "true")
.format("csv")
.load("D:/data/dnr10.txt")
result.show()
// +-------+----------+------+---+
// |DEPT_NO|ART_GRP_NO|ART_NO| TT|
// +-------+----------+------+---+
// | 29| 102|354814|SKO|
// | 29| 102|342677|SKO|
// | 29| 102|334634|DUR|
// | 29| 102|276728|I-P|
// +-------+----------+------+---+
result.printSchema()
// root
// |-- DEPT_NO: integer (nullable = true)
// |-- ART_GRP_NO: integer (nullable = true)
// |-- ART_NO: integer (nullable = true)
// |-- TT: string (nullable = true)
您将获得列名,正确的类型-一切...:)
You'll get the column names, the right types - everything... :)
这篇关于如何使用Spark将文本文件拆分为多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!