pyspark的多个列拆分列没有大 pandas [英] pyspark split a column to multiple columns without pandas

查看:599
本文介绍了pyspark的多个列拆分列没有大 pandas 的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是如何将一列拆分为多个列。
我不知道为什么 df.toPandas()不起作用。

my question is how to split a column to multiple columns. I don't know why df.toPandas() does not work.

例如,我想改变'df_test'到'df_test2。
我看到了使用熊猫模块的例子很多。有另一种方式?
谢谢你在前进。

For example, I would like to change 'df_test' to 'df_test2'. I saw many examples using the pandas module. Is there another way? Thank you in advance.

df_test = sqlContext.createDataFrame([
(1, '14-Jul-15'),
(2, '14-Jun-15'),
(3, '11-Oct-15'),
], ('id', 'date'))

df_test2

df_test2

id     day    month    year
1       14     Jul      15
2       14     Jun      15
1       11     Oct      15


推荐答案

这是不可能在单个接入导出多个顶层列。您可以使用结构或集合类型,像这样的UDF:

It is not possible to derive multiple top level columns in a single access. You can use structs or collection types with an UDF like this:

from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql import Row
from pyspark.sql.functions import udf, col

schema = StructType([
  StructField("day", StringType(), True),
  StructField("month", StringType(), True),
  StructField("year", StringType(), True)
])

def split_date_(s):
    try:
        d, m, y = s.split("-")
        return d, m, y
    except:
        return None

split_date = udf(split_date_, schema)

transformed = df_test.withColumn("date", split_date(col("date")))
transformed.printSchema()

## root
##  |-- id: long (nullable = true)
##  |-- date: struct (nullable = true)
##  |    |-- day: string (nullable = true)
##  |    |-- month: string (nullable = true)
##  |    |-- year: string (nullable = true)

但它不仅是PySpark相当冗长,而且价格昂贵。

but it is not only quite verbose in PySpark, but also expensive.

有关基于日期的转换,你可以简单地使用内置的功能:

For date based transformations you can simply use built-in functions:

from pyspark.sql.functions import unix_timestamp, dayofmonth, year, date_format

transformed = (df_test
    .withColumn("ts",
        unix_timestamp(col("date"), "dd-MMM-yy").cast("timestamp"))
    .withColumn("day", dayofmonth(col("ts")).cast("string"))
    .withColumn("month", date_format(col("ts"), "MMM"))
    .withColumn("year", year(col("ts")).cast("string"))
    .drop("ts"))

同样可以使用 REGEXP_EXTRACT 来分割日期字符串。

又见派生多个列从一个数据框星火

注意

如果您使用的版本不修补 SPARK-11724 这将需要修正在 UNIX_TIMESTAMP(...)和前 CAST(时间戳)

If you use version not patched against SPARK-11724 this will require correction after unix_timestamp(...) and before cast("timestamp").

这篇关于pyspark的多个列拆分列没有大 pandas 的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆