如何在Scala中使用TextQualifier文件的双引号删除双引号和多余的定界符 [英] How to remove double quotes and extra delimiter(s) with in double quotes of TextQualifier file in Scala

查看:81
本文介绍了如何在Scala中使用TextQualifier文件的双引号删除双引号和多余的定界符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多带文本限定符的定界文件(每列的开头和结尾都有双引号).分隔符不一致,即可以有任何分隔符,例如逗号(,),竖线(|),〜,制表符(\ t).

I have a lot of delimited files with Text Qualifier (every column start and end has double quote). Delimited is not consistent i.e. there can be any delimited like comma(,), Pipe (|), ~, tab (\t).

我需要使用 spark.read.textFile (单列)读取此文件,然后删除带有双引号的Text Qualifier和定界符(需要用空格替换定界符).在这里我不想考虑列,即我不应该拆分成列

I need to read this file with spark.read.textFile (single column) and then remove Text Qualifier along with delimiter (need to replace delimiter with space) with in double quotes. Here I want do with out considering columns i.e. I should not split into columns

下面是具有3列ID,名称和DESC的测试数据.DESC列具有额外的定界符.

Below is test data with 3 columns ID, Name and DESC. DESC column has extra delimiter.

val y = """4 , "XAA" , "sf,sd\nsdfsf""""
val pattern = """"[^"]*(?:""[^"]*)*"""".r
val output = pattern replaceAllIn (y, m => m.group(0).replaceAll("[,\n]", " "))

我得到了上面的代码,可以很好地处理静态值.但是我无法申请DF.

I got above code which works fine for static value. But I am not able to apply to DF.

"ID",名称","DESC"
"1","ABC","A,B C"
"2","XYZ","ABC麻烦"
"3","YYZ","FER" sfsf,sfd f"
4,"XAA","sf,sd sdfsf"

"ID","Name","DESC"
"1" , "ABC", "A,B C"
"2" , "XYZ" , "ABC is bother"
"3" , "YYZ" , "FER" sfsf,sfd f"
4 , "XAA" , "sf,sd sdfsf"

我需要输出为

ID,名称,DESC
1,ABC,A B C
2,XYZ,ABC麻烦
3,YYZ,FER" sfsf sfd f
4,XAA,sf sd sdfsf

ID,Name,DESC
1 , ABC , A B C
2 , XYZ , ABC is bother
3 , YYZ , FER" sfsf sfd f
4 , XAA , sf sd sdfsf

预先感谢.

已解决

var SourceFile = spark.read.textFile("/data/test.csv")
val SourceFileDF= SourceFile.withColumn("value", RemoveQualifier(col("value")))
def RemoveQualifier = udf((RawData:String)=>
  {
    var Data = RawData

    val pattern = """"[^"]*(?:""[^"]*)*"""".r

    Data  = pattern replaceAllIn (Data , m => m.group(0).replaceAll("[,]", " "))
   Data 
  })

谢谢.

推荐答案

您可以像这样使用两个replaceAll():

you can two replaceAll() like this use like this:

val输出=模式replaceAllIn(y,m => m.group(0).replaceAll("[,\\\\ n]",").replaceAll("\" | \","))

输出:String = 4,XAA,sf sd sdfsf

output: String = 4 , XAA , sf sd sdfsf

这篇关于如何在Scala中使用TextQualifier文件的双引号删除双引号和多余的定界符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆