如何在Python中排除Spark数据帧中的多列 [英] How to exclude multiple columns in Spark dataframe in Python

查看:155
本文介绍了如何在Python中排除Spark数据帧中的多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发现PySpark有一个名为drop的方法,但看来它一次只能删除一列.关于如何同时删除多列的任何想法吗?

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?

df.drop(['col1','col2'])

TypeError                                 Traceback (most recent call last)
<ipython-input-96-653b0465e457> in <module>()
----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])

/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)
   1257             jdf = self._jdf.drop(col._jc)
   1258         else:
-> 1259             raise TypeError("col should be a string or a Column")
   1260         return DataFrame(jdf, self.sql_ctx)
   1261 

TypeError: col should be a string or a Column

推荐答案

仅使用select:

df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])

或者如果您真的想使用drop,那么reduce应该可以解决问题:

or if you really want to use drop then reduce should do the trick:

from functools import reduce
from pyspark.sql import DataFrame

reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)

注意:

(执行时间差异):

在数据处理时间上应该没有区别.这些方法生成不同的逻辑计划时,物理计划是完全相同的.

There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.

但是,当我们分析驱动程序端代码时,会有所不同:

There is a difference however when we analyze driver-side code:

  • 第一种方法只调用一次JVM,而第二种方法则必须为必须排除的每一列调用JVM
  • 第一种方法生成等效于物理计划的逻辑计划.在第二种情况下,它将被重写.
  • 最终的理解力在Python中比mapreduce
  • 之类的方法要快得多.
  • Spark 2.x + 支持drop中的多个列.请参见 SPARK-11884 (在DataFrame API中拖放多列)和 SPARK-12204 ( DataFrame中的实施放置方法SparkR )以获取详细信息.
  • the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
  • the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
  • finally comprehensions are significantly faster in Python than methods like map or reduce
  • Spark 2.x+ supports multiple columns in drop. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.

这篇关于如何在Python中排除Spark数据帧中的多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆