如何在 PySpark Dataframe show 中设置显示精度 [英] How to set display precision in PySpark Dataframe show

查看:91
本文介绍了如何在 PySpark Dataframe show 中设置显示精度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 PySpark 中调用 .show() 时如何设置显示精度?

How do you set the display precision in PySpark when calling .show()?

考虑以下示例:

from math import sqrt
import pyspark.sql.functions as f

data = zip(
    map(lambda x: sqrt(x), range(100, 105)),
    map(lambda x: sqrt(x), range(200, 205))
)
df = sqlCtx.createDataFrame(data, ["col1", "col2"])
df.select([f.avg(c).alias(c) for c in df.columns]).show()

输出:

#+------------------+------------------+
#|              col1|              col2|
#+------------------+------------------+
#|10.099262230352151|14.212583322380274|
#+------------------+------------------+

我怎样才能改变它,让它只显示小数点后的 3 位数字?

How can I change it so that it only displays 3 digits after the decimal point?

所需的输出:

#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

这是这个scala问题的PySpark版本.我在这里发布它是因为我在搜索 PySpark 解决方案时找不到答案,我认为它可以在未来对其他人有所帮助.

This is a PySpark version of this scala question. I'm posting it here because I could not find an answer when searching for PySpark solutions, and I think it can be helpful to others in the future.

推荐答案

圆形

最简单的选择是使用 pyspark.sql.functions.round():

from pyspark.sql.functions import avg, round
df.select([round(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

这会将值保持为数字类型.

This will maintain the values as numeric types.

函数 对于 scalapython 是一样的.唯一的区别是import.

The functions are the same for scala and python. The only difference is the import.

您可以使用 format_number 将数字格式化为所需的小数位,如官方 api 文档中所述:

You can use format_number to format a number to desired decimal places as stated in the official api document:

将数字列 x 格式化为类似 '#,###,###.##' 的格式,四舍五入到 d 位小数,并将结果作为字符串列返回.

Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.

from pyspark.sql.functions import avg, format_number 
df.select([format_number(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

转换后的列将是 StringType 并且逗号用作千位分隔符:

The transformed columns would of StringType and a comma is used as a thousands separator:

#+-----------+--------------+
#|       col1|          col2|
#+-----------+--------------+
#|500,100.000|50,489,590.000|
#+-----------+--------------+

正如此 answer 的 Scala 版本所述,我们可以使用 regexp_replace 用你想要的任何字符串替换,

As stated in the scala version of this answer we can use regexp_replace to replace the , with any string you want

用rep替换指定字符串值中匹配regexp的所有子字符串.

Replace all substrings of the specified string value that match regexp with rep.

from pyspark.sql.functions import avg, format_number, regexp_replace
df.select(
    [regexp_replace(format_number(avg(c), 3), ",", "").alias(c) for c in df.columns]
).show()
#+----------+------------+
#|      col1|        col2|
#+----------+------------+
#|500100.000|50489590.000|
#+----------+------------+

这篇关于如何在 PySpark Dataframe show 中设置显示精度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆