是否可以在Pyspark中继承DataFrame? [英] Is it possible to subclass DataFrame in Pyspark?

查看:71
本文介绍了是否可以在Pyspark中继承DataFrame?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Pyspark的文档显示了由sqlContextsqlContext.read()和多种其他方法构造的DataFrame.

The documentation for Pyspark shows DataFrames being constructed from sqlContext, sqlContext.read(), and a variety of other methods.

(请参见 https://spark.apache .org/docs/1.6.2/api/python/pyspark.sql.html )

是否可以继承Dataframe并单独实例化它?我想向基本DataFrame类添加方法和功能.

Is it possible to subclass Dataframe and instantiate it independently? I would like to add methods and functionality to the base DataFrame class.

推荐答案

这确实取决于您的目标.

It really depends on your goals.

  • 从技术上讲,这是可能的. pyspark.sql.DataFrame只是一个普通的Python类.您可以根据需要扩展它或猴子补丁.

  • Technically speaking it is possible. pyspark.sql.DataFrame is just a plain Python class. You can extend it or monkey-patch if you need.

from pyspark.sql import DataFrame

class DataFrameWithZipWithIndex(DataFrame):
     def __init__(self, df):
         super(self.__class__, self).__init__(df._jdf, df.sql_ctx)

     def zipWithIndex(self):
         return (self.rdd
             .zipWithIndex()
             .map(lambda row: (row[1], ) + row[0])
             .toDF(["_idx"] + self.columns))

示例用法:

df = sc.parallelize([("a", 1)]).toDF(["foo", "bar"])

with_zipwithindex = DataFrameWithZipWithIndex(df)

isinstance(with_zipwithindex, DataFrame)

True

with_zipwithindex.zipWithIndex().show()

+----+---+---+
|_idx|foo|bar|
+----+---+---+
|   0|  a|  1|
+----+---+---+

  • 实际上,您将无法在这里做很多事情. DataFrame是围绕JVM对象的一个​​瘦包装器,除了提供文档字符串,将参数转换为本地所需的形式,调用JVM方法并在必要时使用Python适配器包装结果之外,没有做太多其他事情.

  • Practically speaking you won't be able to do much here. DataFrame is an thin wrapper around JVM object and doesn't do much beyond providing docstrings, converting arguments to the form required natively, calling JVM methods, and wrapping the results using Python adapters if necessary.

    使用普通的Python代码,您甚至将无法访问DataFrame/Dataset内部或修改其核心行为.如果您正在寻找独立的,仅Python的Spark DataFrame实现,则不可能.

    With plain Python code you won't be able to even go near DataFrame / Dataset internals or modify its core behavior. If you're looking for standalone, Python only Spark DataFrame implementation it is not possible.

    这篇关于是否可以在Pyspark中继承DataFrame?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

  • 查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆