如何在spark中保存带有多个分隔符的文件 [英] How to Save a file with multiple delimiter in spark
问题描述
我需要保存一个由 "|~"
字符分隔的文件,但是当我执行下面的命令时出现错误.我可以在 Spark 中使用多个分隔符保存文件吗?
I need to save a file delimited by "|~"
characters but I get an error when I execute the command below. Can I save a file using multiple delimiters in Spark?
mydf1.coalesce(1).write.option("compression","none").format("csv").mode("Overwrite").option("delimiter","|~").save("my_hdfs_path")
// Error : pyspark.sql.utils.IllegalArgumentException: u'Delimiter cannot be more than one character: |~'
推荐答案
AFAIK,我们仍在等待官方"解决方案,因为问题 "支持 Spark CSV 读取中的多个分隔符" 仍然开放,并且 仍然依赖
AFAIK, we are still waiting for an "official" solution, because the issue "support for multiple delimiter in Spark CSV read" is still open, and SparkSQL still rely on univocity-parsers. In univocity CSV settings, the CSV delimiter can only be a single character, which constrains both the parser (reader) and generator (writer).
解决方法
找到一种普遍快速且安全的 CSV 写入方式很困难.但是取决于您的数据大小和 CSV 内容的复杂性(日期格式?货币?引用?),我们可能会找到一个快捷方式.以下只是一些,希望能激发灵感的想法......
Finding a universally fast and safe way to write as CSV is hard. But depends on your data size and the complexity of CSV contents (date format? currency? quoting?), we may find a shortcut. Following are just some, hopefully inspiring, thoughts...
用特殊字符(比如
⊢
)写入CSV,然后替换为|~
.
write to CSV with special character (say
⊢
) then substitute to|~
.
(尚未进行基准测试,但 IMO 非常希望成为最快的)
(haven't been benchmarked, but IMO it's very hopeful to be the fastest)
df.coalesce(1).write.option("compression","none").option("delimiter", "⊢").mode("overwrite").csv("raw-output")
然后使用(最好在本地)进行后处理,例如 sed
then post-process with (ideally locally) with, say sed
sed -i '.bak' 's/⊢/\|~/g' raw-output/*.csv
在 PySpark 中,将每一行连接成一个字符串,然后作为文本文件写入
within PySpark, concatenate each row to a string, then write as a text file
(可以灵活应对地域和特殊需求——多做一点工作)
(can be flexible to deal with locality and special needs -- with a bit more work)
d = [{'name': 'Alice', 'age': 1},{'name':'Bob', 'age':3}]
df = spark.createDataFrame(d, "name:string, age:int")
df.show()
#+-----+---+
#| name|age|
#+-----+---+
#|Alice| 1|
#| Bob| 3|
#+-----+---+
@udf
def mkstr(name, age):
"""
for example, the string field {name} should be quoted with `"`
"""
return '"{name}"|~{age}'.format(name=name, age=age)
# unparse a CSV row back to a string
df_unparsed = df.select(mkstr("name", "age").alias("csv_row"))
df_unparsed.show()
#+----------+
#| csv_row|
#+----------+
#|"Alice"|~1|
#| "Bob"|~3|
#+----------+
df_unparsed.coalesce(1).write.option("compression", "none").mode("overwrite").text("output")
numpy.savetxt
允许多个字符作为分隔符,所以...
numpy.savetxt
allows multiple character as delimiter, so ...
(如果你关心浮点数的精度,numpy 有很多内置函数)
(numpy has lots of builtins if you cares about precisions of floating numbers)
import pandas as pd
import numpy as np
# convert `Spark.DataFrame` to `Pandas.DataFrame`
df_pd = df.toPandas()
# use `numpy.savetxt` to save `Pandas.DataFrame`
np.savetxt("a-long-day.csv", df_pd, delimiter="|~", fmt="%s")
这篇关于如何在spark中保存带有多个分隔符的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!