如何编写标准CSV [英] How to write standard CSV

查看:160
本文介绍了如何编写标准CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

读取标准CSV 文件非常简单,例如:

It is very simple to read a standard CSV file, for example:

 val t = spark.read.format("csv")
 .option("inferSchema", "true")
 .option("header", "true")
 .load("file:///home/xyz/user/t.csv")

它读取一个真实的CSV文件,如

It reads a real CSV file, something as

   fieldName1,fieldName2,fieldName3
   aaa,bbb,ccc
   zzz,yyy,xxx

t.show产生了预期的结果.

我需要相反的操作,写入标准CSV文件(而不是非标准文件的目录).

I need the inverse, to write standard CSV file (not a directory of non-standard files).

使用write时看不到相反的结果非常令人沮丧.也许有其他选择或某种format (" REAL csv please! ")存在.

It is very frustrating not to see the inverse result when write is used. Maybe some other option or some kind of format (" REAL csv please! ") exists.

我正在使用Spark v2.2 ,并在 Spark-shell 上运行测试.

I am using Spark v2.2 and running tests on Spark-shell.

read 的语法倒数"是 write ,因此预期会产生相同的文件格式.但是

The "syntatical inverse" of read is write, so is expected to produce same file format with it. But the result of

   t.write.format("csv").option("header", "true").save("file:///home/xyz/user/t-writed.csv")

不是原始格式t.csv rfc4180 标准格式的CSV文件, 但是带有文件的t-writed.csv/文件夹 part-00000-66b020ca-2a16-41d9-ae0a-a6a8144c7dbc-c000.csv.deflate _SUCCESS 似乎是实木复合地板","ORC"或其他格式.

is not a CSV file of rfc4180 standard format, as the original t.csv, but a t-writed.csv/ folder with the file part-00000-66b020ca-2a16-41d9-ae0a-a6a8144c7dbc-c000.csv.deflate _SUCCESS that seems a "parquet", "ORC" or other format.

任何具有读东西"的完整工具包的语言都能够写东西",这是一种正交性原理.

Any language with a complete kit of things that "read someting" is able to "write the something", it is a kind of orthogonality principle.

类似的问题或无法解决问题的链接,可能使用了不兼容的Spark版本,或者可能是 spark-shell 的限制来使用它.他们为专家提供了很好的线索:

Similar question or links that not solved the problem, perhaps used a incompatible Spark version, or perhaps spark-shell a limitation to use it. They have good clues for experts:

  • 这是一个由@JochemKuijpers指出的类似问题类似的问题:我尝试提出建议,但得到了同样难看的结果.

  • This similar question pointed by @JochemKuijpers: I try suggestion but obtain same ugly result.

此链接说有解决方案(!),但是我无法在spark-shell中复制/粘贴saveDfToCsv()(错误:未找到:键入DataFrame" ),有些线索吗?

This link say that there are a solution (!), but I can't copy/paste saveDfToCsv() in my spark-shell ("error: not found: type DataFrame"), some clue?

推荐答案

如果由于使用大" *数据集而使用Spark,则可能不希望使用coalesce(1),因为这很可能会使您的驱动程序崩溃(因为整个数据集必须放入驱动程序RAM中,通常不需要).

If you're using Spark because you're working with "big"* datasets, you probably don't want to anything like coalesce(1) or toPandas() since that will most likely crash your driver (since the whole dataset has to fit in the drivers RAM, which it usually does not).

另一方面:如果您的数据是否适合在一台计算机的RAM中-为什么要用分布式计算折磨自己?

On the other hand: If your data does fit into the RAM of a single machine - why are you torturing yourself with distributed computing?

*定义有所不同.我的个人是不适合使用Excel工作表".

*definitions vary. My personal is "does not fit in an excel sheet".

这篇关于如何编写标准CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆