如何在spark中保存带有多个分隔符的文件 [英] How to Save a file with multiple delimiter in spark

查看:107
本文介绍了如何在spark中保存带有多个分隔符的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要保存一个由 "|~" 字符分隔的文件,但是当我执行下面的命令时出现错误.我可以在 Spark 中使用多个分隔符保存文件吗?

I need to save a file delimited by "|~" characters but I get an error when I execute the command below. Can I save a file using multiple delimiters in Spark?

mydf1.coalesce(1).write.option("compression","none").format("csv").mode("Overwrite").option("delimiter","|~").save("my_hdfs_path")

// Error : pyspark.sql.utils.IllegalArgumentException: u'Delimiter cannot be more than one character: |~'

推荐答案

AFAIK,我们仍在等待官方"解决方案,因为问题 "支持 Spark CSV 读取中的多个分隔符" 仍然开放,并且 仍然依赖"https://github.com/uniVocity/univocity-parsers" rel="nofollow noreferrer">univocity-parsers.在univocity CSV 设置中,CSV 分隔符只能是单个字符,这限制了解析器(读取器)和生成器(写入器).

AFAIK, we are still waiting for an "official" solution, because the issue "support for multiple delimiter in Spark CSV read" is still open, and SparkSQL still rely on univocity-parsers. In univocity CSV settings, the CSV delimiter can only be a single character, which constrains both the parser (reader) and generator (writer).

解决方法

找到一种普遍快速且安全的 CSV 写入方式很困难.但是取决于您的数据大小和 CSV 内容的复杂性(日期格式?货币?引用?),我们可能会找到一个快捷方式.以下只是一些,希望能激发灵感的想法......

Finding a universally fast and safe way to write as CSV is hard. But depends on your data size and the complexity of CSV contents (date format? currency? quoting?), we may find a shortcut. Following are just some, hopefully inspiring, thoughts...

  1. 用特殊字符(比如)写入CSV,然后替换为|~.

  1. write to CSV with special character (say ) then substitute to |~.

(尚未进行基准测试,但 IMO 非常希望成为最快的)

(haven't been benchmarked, but IMO it's very hopeful to be the fastest)

df.coalesce(1).write.option("compression","none").option("delimiter", "⊢").mode("overwrite").csv("raw-output")

然后使用(最好在本地)进行后处理,例如 sed

then post-process with (ideally locally) with, say sed

sed -i '.bak' 's/⊢/\|~/g' raw-output/*.csv

  • 在 PySpark 中,将每一行连接成一个字符串,然后作为文本文件写入

  • within PySpark, concatenate each row to a string, then write as a text file

    (可以灵活应对地域和特殊需求——多做一点工作)

    (can be flexible to deal with locality and special needs -- with a bit more work)

    d = [{'name': 'Alice', 'age': 1},{'name':'Bob', 'age':3}]
    df = spark.createDataFrame(d, "name:string, age:int")
    
    df.show()
    
    #+-----+---+
    #| name|age|
    #+-----+---+
    #|Alice|  1|
    #|  Bob|  3|
    #+-----+---+
    
    
    @udf
    def mkstr(name, age):
        """
        for example, the string field {name} should be quoted with `"`
        """
        return '"{name}"|~{age}'.format(name=name, age=age)
    
    # unparse a CSV row back to a string
    df_unparsed = df.select(mkstr("name", "age").alias("csv_row"))
    df_unparsed.show()
    
    #+----------+
    #|   csv_row|
    #+----------+
    #|"Alice"|~1|
    #|  "Bob"|~3|
    #+----------+
    
    df_unparsed.coalesce(1).write.option("compression", "none").mode("overwrite").text("output")
    

  • numpy.savetxt 允许多个字符作为分隔符,所以...

  • numpy.savetxt allows multiple character as delimiter, so ...

    (如果你关心浮点数的精度,numpy 有很多内置函数)

    (numpy has lots of builtins if you cares about precisions of floating numbers)

    import pandas as pd
    import numpy as np
    
    # convert `Spark.DataFrame` to `Pandas.DataFrame`
    df_pd = df.toPandas()
    
    # use `numpy.savetxt` to save `Pandas.DataFrame`
    np.savetxt("a-long-day.csv", df_pd, delimiter="|~", fmt="%s")
    

  • 这篇关于如何在spark中保存带有多个分隔符的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆