spark-sql在输入数据中是否支持多个定界符? [英] Does spark-sql support multiple delimiters in the input data?
问题描述
我有一个带有多个单字符定界符的输入数据,如下所示:
I have an input data with multiple single character delimiters as followed :
col1data1"col2data1;col3data1"col4data1
col1data2"col2data2;col3data2"col4data2
col1data3"col2data3;col3data3"col4data3
在以上数据中,[],[;]是我的定界符.
In the above data the ["] ,[;] are my delimiters.
sparkSQL 中是否可以将输入数据(文件中的数据)直接转换为列名称为col1,col2,col3,col4的表
Is there any way in sparkSQL to convert directly the input data( which is in a file) into a table with column names col1,col2,col3,col4
推荐答案
答案是否,spark-sql不支持多定界符,但是一种解决方法是尝试读取它文件放入RDD中,然后使用常规的拆分方法进行解析:
The answer is no, spark-sql does not support multi-delimiter but one way to do it is trying to read it your file into an RDD and than parse it using regular splitting methods :
val rdd : RDD[String] = ???
val s = rdd.first()
// res1: String = "This is one example. This is another"
比方说,您想在空间上分割并断点.
Let's say that you want to split on space and point break.
,因此我们可以考虑将函数应用于s
值,如下所示:
so we can consider apply our function on our s
value as followed :
s.split(" |\\.")
// res2: Array[String] = Array(This, is, one, example, "", This, is, another)
现在我们可以将函数应用于整个rdd
:
now we can apply the function on the whole rdd
:
rdd.map(_.split(" |\\."))
数据示例:
scala> val s = "col1data1\"col2data1;col3data1\"col4data1"
scala> s.split(";|\"")
res4: Array[String] = Array(col1data1, col2data1, col3data1, col4data1)
有关字符串拆分的更多信息:
More on string splitting :
- A Scala split String example.
- How to split String in Scala but keep the part matching the regular expression?
请记住,您可以对常规数据类型应用的所有内容,也可以对整个RDD应用的所有内容,那么您要做的就是将RDD转换为DataFrame.
Just remember that everything you can apply on a regular data type you can apply on a whole RDD, then all you have to do is converting your RDD to a DataFrame.
这篇关于spark-sql在输入数据中是否支持多个定界符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!