Pandas 数据帧到 Spark 数据帧,处理 NaN 转换为实际空值? [英] Pandas dataframe to Spark dataframe, handling NaN conversions to actual null?

查看:38
本文介绍了Pandas 数据帧到 Spark 数据帧,处理 NaN 转换为实际空值?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据帧从 Pandas 转换为 spark,我正在使用 spark_context.createDataFrame() 方法来创建数据帧.我还在 createDataFrame() 方法中指定架构.

I want to convert dataframe from pandas to spark and I am using spark_context.createDataFrame() method to create the dataframe. I'm also specifying the schema in the createDataFrame() method.

我想知道的是如何处理特殊情况.例如,当转换为 Spark 数据帧时,pandas 中的 NaN 最终是字符串NaN".我正在寻找如何获得实际空值而不是NaN"的方法.

What I want to know is how handle special cases. For example, NaN in pandas when converted to Spark dataframe ends up being string "NaN". I am looking for ways how to get actual nulls instead of "NaN".

推荐答案

TL;DR 你现在最好的选择是完全跳过 Pandas.

TL;DR Your best option for now is to skip Pandas completely.

问题的根源在于 Pandas 的表现力不如 Spark SQL.Spark 提供 NULL(在 SQL 意义上,作为缺失值)和 NaN(数字不是数字).

The source of the problem is that Pandas are less expressive than Spark SQL. Spark provides both NULL (in a SQL sense, as missing value) and NaN (numeric Not a Number).

另一方面的 Pandas 没有可用于表示缺失值的本机值.因此,它使用像 NaN/NaTInf 这样的占位符,它们与 Spark 与实际的 NaNInfs 和转换规则取决于列类型.唯一的例外是 object 列(通常是字符串),它可以包含 None 值.您可以从 文档 中了解有关处理 Pandas 缺失值的更多信息.

Pandas from the other handm doesn't have native value which can be used to represent missing values. As a result it uses placeholders like NaN / NaT or Inf, which are indistinguishable to Spark from actual NaNs and Infs and conversion rules depend on the column type. The only exception are object columns (typically strings) which can contain None values. You can learn more about handling missing values Pandas from the documentation.

例如,当转换为 Spark 数据帧时,pandas 中的 NaN 最终是字符串NaN".

For example, NaN in pandas when converted to Spark dataframe ends up being string "NaN".

这实际上是不正确的.取决于输入列的类型.如果列显示 NaN,它很可能不是数字值,也不是纯字符串:

This is actually not correct. Depending on type of input column. If column shows NaN it is most likely not a number value, not a plain string:

from pyspark.sql.functions import isnan, isnull

pdf = pd.DataFrame({
    "x": [1, None], "y": [None, "foo"], 
    "z": [pd.Timestamp("20120101"), pd.Timestamp("NaT")]
})
sdf = spark.createDataFrame(pdf)

sdf.show()

+---+----+-------------------+
|  x|   y|                  z|
+---+----+-------------------+
|1.0|null|2012-01-01 00:00:00|
|NaN| foo|               null|
+---+----+-------------------+

sdf.select([
    f(c) for c in sdf.columns for f in [isnan, isnull] 
    if (f, c) != (isnan, "z")  # isnan cannot be applied to timestamp 
]).show()

+--------+-----------+--------+-----------+-----------+
|isnan(x)|(x IS NULL)|isnan(y)|(y IS NULL)|(z IS NULL)|
+--------+-----------+--------+-----------+-----------+
|   false|      false|   false|       true|      false|
|    true|      false|   false|      false|       true|
+--------+-----------+--------+-----------+-----------+

实际上,除了简单的测试和玩具示例之外,并行化的本地集合(包括 Pandas 对象)的重要性可以忽略不计,因此您始终可以手动转换数据(跳过可能的 Arrow 优化):

In practice, parallelized local collections (including Pandas objects) have negligible importance beyond simple testing and toy examples so you can always convert data manually (skipping possible Arrow optimizations):

import numpy as np

spark.createDataFrame([
   tuple(
        None if isinstance(x, (float, int)) and np.isnan(x) else x
        for x in record.tolist())
   for record in pdf.to_records(index=False)
], pdf.columns.tolist()).show()

+----+----+-------------------+
|   x|   y|                  z|
+----+----+-------------------+
| 1.0|null|1325376000000000000|
|null| foo|               null|
+----+----+-------------------+

如果缺少/非数字歧义不是问题,那么只需像往常一样加载数据并在 Spark 中替换.

If missing / not-a-number ambiguity is not an issue then just load data as usually and replace in Spark.

from pyspark.sql.functions import col, when 

sdf.select([
    when(~isnan(c), col(c)).alias(c) if t in ("double", "float") else c 
    for c, t in sdf.dtypes
]).show()

+----+----+-------------------+
|   x|   y|                  z|
+----+----+-------------------+
| 1.0|null|2012-01-01 00:00:00|
|null| foo|               null|
+----+----+-------------------+

这篇关于Pandas 数据帧到 Spark 数据帧,处理 NaN 转换为实际空值?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆