pandas 数据框到Spark数据框,是否将NaN转换为实际的null? [英] Pandas dataframe to Spark dataframe, handling NaN conversions to actual null?

查看:153
本文介绍了 pandas 数据框到Spark数据框,是否将NaN转换为实际的null?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据帧从大熊猫转换为星火,并且我正在使用spark_context.createDataFrame()方法创建数据帧.我还在createDataFrame()方法中指定了架构.

I want to convert dataframe from pandas to spark and I am using spark_context.createDataFrame() method to create the dataframe. I'm also specifying the schema in the createDataFrame() method.

我想知道的是如何处理特殊情况.例如,当转换为Spark数据框时,熊猫中的 NaN 最终为字符串"NaN".我正在寻找如何获取实际的空值而不是"NaN"的方法.

What I want to know is how handle special cases. For example, NaN in pandas when converted to Spark dataframe ends up being string "NaN". I am looking for ways how to get actual nulls instead of "NaN".

推荐答案

TL; DR 目前最好的选择是完全跳过熊猫.

TL;DR Your best option for now is to skip Pandas completely.

问题的根源在于,熊猫不如Spark SQL富有表现力. Spark提供NULL(从SQL的意义上说是缺失值)和NaN(数字而不是数字).

The source of the problem is that Pandas are less expressive than Spark SQL. Spark provides both NULL (in a SQL sense, as missing value) and NaN (numeric Not a Number).

另一只手的熊猫没有可用于表示缺失值的本机值.结果,它使用了占位符,如NaN/NaTInf,这与Spark与实际的NaNsInfs是无法区分的,并且转换规则取决于列类型.唯一的例外是object列(通常是字符串),其中可以包含None值.您可以从文档了解有关处理缺失值的更多信息.

Pandas from the other handm doesn't have native value which can be used to represent missing values. As a result it uses placeholders like NaN / NaT or Inf, which are indistinguishable to Spark from actual NaNs and Infs and conversion rules depend on the column type. The only exception are object columns (typically strings) which can contain None values. You can learn more about handling missing values Pandas from the documentation.

例如,当转换为Spark数据帧时,熊猫中的NaN最终为字符串"NaN".

For example, NaN in pandas when converted to Spark dataframe ends up being string "NaN".

这实际上是不正确的.取决于输入列的类型.如果列显示NaN,则很可能不是数字值,不是纯字符串:

This is actually not correct. Depending on type of input column. If column shows NaN it is most likely not a number value, not a plain string:

from pyspark.sql.functions import isnan, isnull

pdf = pd.DataFrame({
    "x": [1, None], "y": [None, "foo"], 
    "z": [pd.Timestamp("20120101"), pd.Timestamp("NaT")]
})
sdf = spark.createDataFrame(pdf)

sdf.show()

+---+----+-------------------+
|  x|   y|                  z|
+---+----+-------------------+
|1.0|null|2012-01-01 00:00:00|
|NaN| foo|               null|
+---+----+-------------------+

sdf.select([
    f(c) for c in sdf.columns for f in [isnan, isnull] 
    if (f, c) != (isnan, "z")  # isnan cannot be applied to timestamp 
]).show()

+--------+-----------+--------+-----------+-----------+
|isnan(x)|(x IS NULL)|isnan(y)|(y IS NULL)|(z IS NULL)|
+--------+-----------+--------+-----------+-----------+
|   false|      false|   false|       true|      false|
|    true|      false|   false|      false|       true|
+--------+-----------+--------+-----------+-----------+

实际上,并行化的本地集合(包括Pandas对象)在简单的测试和玩具示例之外的重要性可忽略不计,因此您始终可以手动转换数据(跳过可能的Arrow优化):

In practice, parallelized local collections (including Pandas objects) have negligible importance beyond simple testing and toy examples so you can always convert data manually (skipping possible Arrow optimizations):

import numpy as np

spark.createDataFrame([
   tuple(
        None if isinstance(x, (float, int)) and np.isnan(x) else x
        for x in record.tolist())
   for record in pdf.to_records(index=False)
], pdf.columns.tolist()).show()

+----+----+-------------------+
|   x|   y|                  z|
+----+----+-------------------+
| 1.0|null|1325376000000000000|
|null| foo|               null|
+----+----+-------------------+

如果不存在遗漏/数字不确定性的问题,那么只需照常加载数据并替换为Spark.

If missing / not-a-number ambiguity is not an issue then just load data as usually and replace in Spark.

from pyspark.sql.functions import col, when 

sdf.select([
    when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c 
    for c, t in sdf.dtypes
]).show()

+----+----+-------------------+
|   x|   y|                  z|
+----+----+-------------------+
| 1.0|null|2012-01-01 00:00:00|
|null| foo|               null|
+----+----+-------------------+

这篇关于 pandas 数据框到Spark数据框,是否将NaN转换为实际的null?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆