pyspark读取带有多行列的文本文件 [英] pyspark read text file with multiline column

查看：82 发布时间：2021/5/4 21:02:10 csv dataframe apache-spark pyspark etl

本文介绍了pyspark读取带有多行列的文本文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有以下格式错误的txt文件:

I have the following bad formatted txt file:

id;text;contact_id
1;Reason contact\
\
The client was not satisfied about the quality of the product\
\
;c_102932131

我正在尝试通过以下方式使用pyspark加载文件:

I'm trying to load the file using pyspark by using:

df = sc.read\
.option("delimiter", ";")\
.option("header", "true")\
.option("inferSchema", "true")\
.option("multiLine", "true")\
.option("wholeFile", "true")\
.csv(os.path.join(appconfig.configs[appconfig.ENV]["ROOT_DIR"], "data", "input", file_name))

但是列文本被截断了，因为数据框是:

But the column text is truncated, since the dataframe is:

id|text|contact_id
1|Reason contact|null
null|null|c_102932131

所以我输了其他所有行.目的是通过这种方式正确读取文件:

So I lose all the other lines. The goal is to read correctly the file in this way:

id|text|contact_id
1|Reason contact The client was satisfied not about the quality of the product|c_102932131

我该怎么做?谢谢

推荐答案

使用 .wholeTextFiles ，然后替换换行(\ n)和 \ 最终创建df.

Use .wholeTextFiles and then replace new line (\n) and \ finally create df.

示例:

Example:

Spark-Scala:

sc.wholeTextFiles("<file_path>").
toDF().
selectExpr("""split(replace(regexp_replace(_2,"[\\\\|\n]",""),"id;text;contact_id",""),";") as new""").
withColumn("id",col("new")(0)).
withColumn("text",col("new")(1)).
withColumn("contact_id",col("new")(2)).
drop("new").
show(false)
//+---+---------------------------------------------------------------------------+-----------+
//|id |text                                                                       |contact_id |
//+---+---------------------------------------------------------------------------+-----------+
//|1  |Reason contactThe client was not satisfied about the quality of the product|c_102932131|
//+---+---------------------------------------------------------------------------+-----------+

Pyspark:

from pyspark.sql.functions import *

sc.wholeTextFiles("<file_path>").\
toDF().\
selectExpr("""split(replace(regexp_replace(_2,'[\\\\\\\\|\n]',''),"id;text;contact_id",""),";") as new""").\
withColumn("id",col("new")[0]).\
withColumn("text",col("new")[1]).\
withColumn("contact_id",col("new")[2]).\
drop("new").\
show(10,False)
#+---+---------------------------------------------------------------------------+-----------+
#|id |text                                                                       |contact_id |
#+---+---------------------------------------------------------------------------+-----------+
#|1  |Reason contactThe client was not satisfied about the quality of the product|c_102932131|
#+---+---------------------------------------------------------------------------+-----------+

这篇关于pyspark读取带有多行列的文本文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark读取带有多行列的文本文件 [英] pyspark read text file with multiline column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark读取带有多行列的文本文件 [英] pyspark read text file with multiline column

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭