如何在Scala中使用TextQualifier文件的双引号删除双引号和多余的定界符 [英] How to remove double quotes and extra delimiter(s) with in double quotes of TextQualifier file in Scala
问题描述
我有很多带文本限定符的定界文件(每列的开头和结尾都有双引号).分隔符不一致,即可以有任何分隔符,例如逗号(,),竖线(|),〜,制表符(\ t).
I have a lot of delimited files with Text Qualifier (every column start and end has double quote). Delimited is not consistent i.e. there can be any delimited like comma(,), Pipe (|), ~, tab (\t).
我需要使用 spark.read.textFile
(单列)读取此文件,然后删除带有双引号的Text Qualifier和定界符(需要用空格替换定界符).在这里我不想考虑列,即我不应该拆分成列
I need to read this file with spark.read.textFile
(single column) and then remove Text Qualifier along with delimiter (need to replace delimiter with space) with in double quotes. Here I want do with out considering columns i.e. I should not split into columns
下面是具有3列ID,名称和DESC的测试数据.DESC列具有额外的定界符.
Below is test data with 3 columns ID, Name and DESC. DESC column has extra delimiter.
val y = """4 , "XAA" , "sf,sd\nsdfsf""""
val pattern = """"[^"]*(?:""[^"]*)*"""".r
val output = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\n]", " "))
我得到了上面的代码,可以很好地处理静态值.但是我无法申请DF.
I got above code which works fine for static value. But I am not able to apply to DF.
"ID",名称","DESC"
"1","ABC","A,B C"
"2","XYZ","ABC麻烦"
"3","YYZ","FER" sfsf,sfd f"
4,"XAA","sf,sd sdfsf"
"ID","Name","DESC"
"1" , "ABC", "A,B C"
"2" , "XYZ" , "ABC is bother"
"3" , "YYZ" , "FER" sfsf,sfd f"
4 , "XAA" , "sf,sd sdfsf"
我需要输出为
ID,名称,DESC
1,ABC,A B C
2,XYZ,ABC麻烦
3,YYZ,FER" sfsf sfd f
4,XAA,sf sd sdfsf
ID,Name,DESC
1 , ABC , A B C
2 , XYZ , ABC is bother
3 , YYZ , FER" sfsf sfd f
4 , XAA , sf sd sdfsf
预先感谢.
已解决
var SourceFile = spark.read.textFile("/data/test.csv")
val SourceFileDF= SourceFile.withColumn("value", RemoveQualifier(col("value")))
def RemoveQualifier = udf((RawData:String)=>
{
var Data = RawData
val pattern = """"[^"]*(?:""[^"]*)*"""".r
Data = pattern replaceAllIn (Data , m => m.group(0).replaceAll("[,]", " "))
Data
})
谢谢.
推荐答案
您可以像这样使用两个replaceAll():
you can two replaceAll() like this use like this:
val输出=模式replaceAllIn(y,m => m.group(0).replaceAll("[,\\\\ n]",").replaceAll("\" | \","))
输出:String = 4,XAA,sf sd sdfsf
output: String = 4 , XAA , sf sd sdfsf
这篇关于如何在Scala中使用TextQualifier文件的双引号删除双引号和多余的定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!