Spark 2.4的lineSep选项 [英] lineSep option with Spark 2.4

查看:82
本文介绍了Spark 2.4的lineSep选项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

lineSep 选项是否确实适用于Spark 2.4版本.

Is lineSep option really works with Spark 2.4 version.

lineSep (default covers all \r, \r\n and \n): defines the line separator that should be used for parsing. Maximum length is 1 character.

我正在将一个Dataframe写入GCS存储桶位置,但是它始终使用行分隔符将其写入为'\ n'.

I am writing a Dataframe to GCS bucket location but its always writing it with line separator as '\n' only.

df
  .select("COLLECTTIME","SITE","NETWORK")
  .coalesce(1)
  .limit(10)
  .write
  .option("header", false)
  .option("compression", "gzip")
  .option("lineSep","\r\n")
  .csv(tmpOutput)

我在每行末尾查找 CRLF .

我也在下面尝试过,但是没有用

I tried below as well but not working

df2.withColumn(df2.columns.last,concat(col(df2.columns.last),lit("\r")))
  .write
  .option("header", false)
  .option("compression", "gzip")
  .csv(tmpOutput)

我也在下面尝试过,但是没有运气.

I tried below as well but no luck.

import org.apache.spark.sql.functions._
df2.withColumn(df2.columns.last,regexp_replace(col(df2.columns.last),"[\\r]","[\\r\\n]"))
  .write
  .option("header", false)
  .option("compression", "gzip")
  .csv(tmpOutput)

现在,我正在考虑再次写入 GCS 文件,并逐行读取并在每条记录的末尾附加"\ r". Spark 2.4 并不是简短易用的东西.我只需要在每条记录的末尾加上'CRLF'.

Now I am thinking to read file again from GCS once its written and read it line by line and append '\r' at the end of each record. isn't something short and simple available with Spark 2.4. I just need to have 'CRLF' at the end of each record.

读取和更新也是不可能的,因为存储在gcs存储桶中的对象是不可变的.我也无法将文件保留在缓冲区中,因为它们的大小也略大

read and update is also not possible since objects stored on gcs buckets are immutable. I cannot keep files on buffer since they little bigger in size as well

推荐答案

很抱歉,但AFAIK恐怕Spark允许您在问题中引用不同的分隔符:

I am very sorry, but AFAIK, I am afraid that Spark allows the different separators you cited in your question:

lineSep (default covers all \r, \r\n and \n): defines the line separator that should be used for parsing. Maximum length is 1 character.

仅用于阅读,但不用于写作;在后一种情况下, \ n 是硬编码的,或者由于Spark版本是 2.4 3.0 ,您可以选择自定义的行分隔符,但仅限于一个字符.

only for reading, but not for writing; in the later case, either \n is hardcoded or, since Spark versions 2.4 and 3.0, you can choose a custom line separator but limited to a single character.

请考虑阅读此Github问题,它提供了有关问题.另外一个也可能会有所帮助.

Please, consider read this Github issue, it provides the whole background about the problem. This other one could be helpful as well.

这篇关于Spark 2.4的lineSep选项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆