如何在Python中排除Spark数据帧中的多列 [英] How to exclude multiple columns in Spark dataframe in Python

查看：155 发布时间：2020/9/4 0:11:55 apache-spark dataframe pyspark apache-spark-sql

本文介绍了如何在Python中排除Spark数据帧中的多列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我发现PySpark有一个名为drop的方法，但看来它一次只能删除一列.关于如何同时删除多列的任何想法吗?

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?

df.drop(['col1','col2'])

TypeError                                 Traceback (most recent call last)
<ipython-input-96-653b0465e457> in <module>()
----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])

/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)
   1257             jdf = self._jdf.drop(col._jc)
   1258         else:
-> 1259             raise TypeError("col should be a string or a Column")
   1260         return DataFrame(jdf, self.sql_ctx)
   1261 

TypeError: col should be a string or a Column

推荐答案

仅使用select:

df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])

或者如果您真的想使用drop，那么reduce应该可以解决问题:

or if you really want to use drop then reduce should do the trick:

from functools import reduce
from pyspark.sql import DataFrame

reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)

注意:

(执行时间差异):

在数据处理时间上应该没有区别.这些方法生成不同的逻辑计划时，物理计划是完全相同的.

There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.

但是，当我们分析驱动程序端代码时，会有所不同:

There is a difference however when we analyze driver-side code:

第一种方法只调用一次JVM，而第二种方法则必须为必须排除的每一列调用JVM
第一种方法生成等效于物理计划的逻辑计划.在第二种情况下，它将被重写.
最终的理解力在Python中比map或reduce
Spark 2.x + 支持drop中的多个列.请参见 SPARK-11884 (在DataFrame API中拖放多列)和 SPARK-12204 ( DataFrame中的实施放置方法SparkR )以获取详细信息.

the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
finally comprehensions are significantly faster in Python than methods like map or reduce
Spark 2.x+ supports multiple columns in drop. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.

这篇关于如何在Python中排除Spark数据帧中的多列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Python中排除Spark数据帧中的多列 [英] How to exclude multiple columns in Spark dataframe in Python

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Python中排除Spark数据帧中的多列 [英] How to exclude multiple columns in Spark dataframe in Python

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭