如何在 PySpark DataFrame 中加载大双精度数并将其保留回来而不将数字格式更改为科学记数法或精度? [英] How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

查看:180
本文介绍了如何在 PySpark DataFrame 中加载大双精度数并将其保留回来而不将数字格式更改为科学记数法或精度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的 CSV:

I have a CSV like that:

COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123

我想将 VAL 列加载为数字类型(由于项目的其他要求),然后按照以下结构将其保存回另一个 CSV:

I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below:

+-----+------------------+
|  COL|               VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2|    200000000.1234|
|TEST3|   9999.1234679123|
+-----+------------------+

我面临的问题是,每当我加载它时,数字都会变成科学记数法,而且我无法在不通知 precisionscale 的情况下将其保留回来我的数据(我想使用它已经存在于文件中的数据,无论它是什么 - 我无法推断出来).这是我尝试过的:

The problem I'm facing is that whenever I load it, the numbers become scientific notation, and I cannot persist it back without having to inform the precision and scale of my data (I want to use the one that it is already in the file, whatever it is - I can't infer it). Here's what I have tried:

DoubleType() 加载它给了我科学记数法:

Loading it with DoubleType() it gives me scientific notation:

schema = StructType([
StructField('COL', StringType()),
StructField('VAL', DoubleType())
])

csv_file = "Downloads/test.csv"
df2 = (spark.read.format("csv")
.option("sep",",")
.option("header", "true")
.schema(schema)
.load(csv_file))

df2.show()

+-----+--------------------+
|  COL|                 VAL|
+-----+--------------------+
| TEST|1.0000000012345679E8|
|TEST2|    2.000000001234E8|
|TEST3|     9999.1234679123|
+-----+--------------------+

使用 DecimalType() 加载它我需要指定 precisionscale,否则,我会丢失点后的小数.但是,指定它,除了没有获得正确值的风险(因为我的数据可能会四舍五入),我在点后得到零:例如,使用: StructField('VAL', DecimalType(38, 18)) 我得到:

Loading it with DecimalType() I'm required to specify precision and scale, otherwise, I lose the decimals after the dot. However, specifying it, besides the risk of not getting the correct value (as my data might be rounded), I get zeros after the dot: For example, using: StructField('VAL', DecimalType(38, 18)) I get:

[Row(COL='TEST', VAL=Decimal('100000000.123456790000000000')),
Row(COL='TEST2', VAL=Decimal('200000000.123400000000000000')),
Row(COL='TEST3', VAL=Decimal('9999.123467912300000000'))]

意识到在这种情况下,我的新文件中的右侧有我不想要的零.

Realise that in this case, I have zeros on the right side that I don't want in my new file.

我发现解决它的唯一方法是使用 UDF,我首先使用 float() 删除科学记数法,然后将其转换为字符串确保它会像我想要的那样持久化:

The only way I found to address it was using a UDF where I first use the float() to remove the scientific notation and then I convert it to string to make sure it will be persisted as I want:

to_decimal = udf(lambda n: str(float(n)))

df2 = df2.select("*", to_decimal("VAL").alias("VAL2"))
df2 = df2.select(["COL", "VAL2"]).withColumnRenamed("VAL2", "VAL")
df2.show()
display(df2.schema)

+-----+------------------+
|  COL|               VAL|
+-----+------------------+
| TEST|100000000.12345679|
|TEST2|    200000000.1234|
|TEST3|   9999.1234679123|
+-----+------------------+

StructType(List(StructField(COL,StringType,true),StructField(VAL,StringType,true)))

有没有不使用 UDF 技巧就能达到同样效果的方法?

There's any way to reach the same without using the UDF trick?

谢谢!

推荐答案

我发现解决它的最佳方法如下.它仍在使用 UDF,但现在没有使用字符串的解决方法来避免科学记数法.我还不会把它作为正确答案,因为我仍然希望有人提出没有 UDF 的解决方案(或者很好地解释了为什么没有 UDFs 就不可能).

The best way I found to address it was as bellow. It is still using UDF, but now, without the workarounds with Strings to avoid scientific notation. I won't make it as correct answer yet, because I still expect someone coming over with a solution without UDF (or a good explanation of why it's not possible without UDFs).

  1. CSV:

$ cat /Users/bambrozi/Downloads/testf.csv
COL,VAL
TEST,100000000.12345679
TEST2,200000000.1234
TEST3,9999.1234679123
TEST4,123456789.01234567

  1. 应用默认 PySpark DecimalType 精度和比例加载 CSV:
  1. Load the CSV applying the default PySpark DecimalType precision and scale:

schema = StructType([
    StructField('COL', StringType()),
    StructField('VAL', DecimalType(38, 18))
])

csv_file = "Downloads/testf.csv"
df2 = (spark.read.format("csv")
        .option("sep",",")
        .option("header", "true")
        .schema(schema)
        .load(csv_file))

df2.show(truncate=False)

输出:

+-----+----------------------------+
|COL  |VAL                         |
+-----+----------------------------+
|TEST |100000000.123456790000000000|
|TEST2|200000000.123400000000000000|
|TEST3|9999.123467912300000000     |
|TEST4|123456789.012345670000000000|
+-----+----------------------------+

  1. 当您准备好报告(打印或保存在新文件中)时,您可以将格式应用于尾随零:

import decimal
import pyspark.sql.functions as F
normalize_decimals = F.udf(lambda dec: dec.normalize())
(df2
    .withColumn('VAL', normalize_decimals(F.col('VAL')))
    .show(truncate=False))

输出:

+-----+------------------+
|COL  |VAL               |
+-----+------------------+
|TEST |100000000.12345679|
|TEST2|200000000.1234    |
|TEST3|9999.1234679123   |
|TEST4|123456789.01234567|
+-----+------------------+

这篇关于如何在 PySpark DataFrame 中加载大双精度数并将其保留回来而不将数字格式更改为科学记数法或精度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆