Pyspark 圆函数问题 [英] Trouble With Pyspark Round Function
问题描述
在 pyspark 中使用 round 函数时遇到一些问题 - 我有下面的代码块,我试图将 new_bid
列四舍五入到小数点后两位,并将该列重命名为bid
之后 - 我正在导入 pyspark.sql.functions AS func
以供参考,并使用其中包含的 round
函数:
Having some trouble getting the round function in pyspark to work - I have the below block of code, where I'm trying to round the new_bid
column to 2 decimal places, and rename the column as bid
afterwards - I'm importing pyspark.sql.functions AS func
for reference, and using the round
function contained within it:
output = output.select(col("ad").alias("ad_id"),
col("part").alias("part_id"),
func.round(col("new_bid"), 2).alias("bid"))
这里的 new_bid
列是 float 类型 - 生成的数据框没有像我试图做的那样将新命名的 bid
列四舍五入到 2 个小数位,而是它仍然是小数点后 8 或 9 位.
the new_bid
column here is of type float - the resulting dataframe does not have the newly named bid
column rounded to 2 decimal places as I am trying to do, rather it is still 8 or 9 decimal places out.
我尝试了各种方法,但似乎无法使生成的数据帧具有四舍五入的值 - 任何指针都将不胜感激!谢谢!
I've tried various things but can't seem to get the resulting dataframe to have the rounded value - any pointers would be greatly appreciated! Thanks!
推荐答案
这里有几种方法可以处理一些玩具数据:
Here are a couple of ways to do it with some toy data:
spark.version
# u'2.2.0'
import pyspark.sql.functions as func
df = spark.createDataFrame(
[(0.0, 0.2, 3.45631),
(0.4, 1.4, 2.82945),
(0.5, 1.9, 7.76261),
(0.6, 0.9, 2.76790),
(1.2, 1.0, 9.87984)],
["col1", "col2", "col3"])
df.show()
# +----+----+-------+
# |col1|col2| col3|
# +----+----+-------+
# | 0.0| 0.2|3.45631|
# | 0.4| 1.4|2.82945|
# | 0.5| 1.9|7.76261|
# | 0.6| 0.9| 2.7679|
# | 1.2| 1.0|9.87984|
# +----+----+-------+
# round 'col3' in a new column:
df2 = df.withColumn("col4", func.round(df["col3"], 2)).withColumnRenamed("col4","new_col3")
df2.show()
# +----+----+-------+--------+
# |col1|col2| col3|new_col3|
# +----+----+-------+--------+
# | 0.0| 0.2|3.45631| 3.46|
# | 0.4| 1.4|2.82945| 2.83|
# | 0.5| 1.9|7.76261| 7.76|
# | 0.6| 0.9| 2.7679| 2.77|
# | 1.2| 1.0|9.87984| 9.88|
# +----+----+-------+--------+
# round & replace existing 'col3':
df3 = df.withColumn("col3", func.round(df["col3"], 2))
df3.show()
# +----+----+----+
# |col1|col2|col3|
# +----+----+----+
# | 0.0| 0.2|3.46|
# | 0.4| 1.4|2.83|
# | 0.5| 1.9|7.76|
# | 0.6| 0.9|2.77|
# | 1.2| 1.0|9.88|
# +----+----+----+
这是个人品味,但我不喜欢 col
或 alias
- 我更喜欢 withColumn
和 withColumnRenamed
代替.不过,如果您想坚持使用 select
和 col
,以下是您应该如何调整自己的代码片段:
It's a personal taste, but I am not a great fan of either col
or alias
- I prefer withColumn
and withColumnRenamed
instead. Nevertheless, if you would like to stick with select
and col
, here is how you should adapt your own code snippet:
from pyspark.sql.functions import col
df4 = df.select(col("col1").alias("new_col1"),
col("col2").alias("new_col2"),
func.round(df["col3"],2).alias("new_col3"))
df4.show()
# +--------+--------+--------+
# |new_col1|new_col2|new_col3|
# +--------+--------+--------+
# | 0.0| 0.2| 3.46|
# | 0.4| 1.4| 2.83|
# | 0.5| 1.9| 7.76|
# | 0.6| 0.9| 2.77|
# | 1.2| 1.0| 9.88|
# +--------+--------+--------+
这篇关于Pyspark 圆函数问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!