根据Spark Scala中的字符串将文件拆分为多个文件 [英] split the file into multiple files based on a string in spark scala

查看：44 发布时间：2021/4/8 20:04:15 scala apache-spark

本文介绍了根据Spark Scala中的字符串将文件拆分为多个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文本文件，其中的以下数据没有特定格式

I have a text file with the below data having no particular format

abc*123     *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~ 
hig*0109*10052200*Rq~
abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~

我希望输出为两个文件，如下所示:

I want the output as two files as below :

基于字符串 abc ，我想分割文件.

Based on string abc, I want to split the file.

文件1:

abc*123     *180109*1005*^*001*0000001*0*T*:~
efg*05*1*X*005010X2A1~
k7*IT 1234*P*234df~ 
hig*0109*10052200*Rq~

文件2:

abc*234*9698*709870*99999*N:~
tng****MI*917937861~
k7*IT 8876*e*278df~
dtp*D8*20171015~

文件名应为IT名称(该行以k7开头)，因此file1名称应为IT_1234，第二个文件名称应为IT_8876.

And the file names should be IT name(the line starts with k7) so file1 name should be IT_1234 second file name should be IT_8876.

推荐答案

我在项目中使用了这个小技巧:

There is this little dirty trick that I used for a project :

sc.hadoopConfiguration.set("textinputformat.record.delimiter", "abc")

您可以设置spark上下文的分隔符以读取文件.因此，您可以执行以下操作:

You can set the delimiter of your spark context for reading files. So you could do something like this :

val delimit = "abc"
sc.hadoopConfiguration.set("textinputformat.record.delimiter", delimit)
val df = sc.textFile("your_original_file.txt")
           .map(x => (delimit ++ x))
           .toDF("delimit_column")
           .filter(col("delimit_column") !== delimit)

然后，您可以映射要写入文件的DataFrame(或RDD)的每个元素.

Then you can map each element of your DataFrame (or RDD) to be written to a file.

这是一种肮脏的方法，但可能会对您有所帮助！

It's a dirty method but it might help you !

祝你有美好的一天

PS:最后的过滤器是删除带有连接定界符的第一行为空

PS : The filter at the end is to drop the first line which is empty with the concatenated delimiter

这篇关于根据Spark Scala中的字符串将文件拆分为多个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据Spark Scala中的字符串将文件拆分为多个文件 [英] split the file into multiple files based on a string in spark scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据Spark Scala中的字符串将文件拆分为多个文件 [英] split the file into multiple files based on a string in spark scala

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭