如何在 Python 中排除 Spark 数据框中的多列 [英] How to exclude multiple columns in Spark dataframe in Python
本文介绍了如何在 Python 中排除 Spark 数据框中的多列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我发现 PySpark 有一个名为 drop
的方法,但它似乎一次只能删除一列.关于如何同时删除多个列的任何想法?
I found PySpark has a method called drop
but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time?
df.drop(['col1','col2'])
TypeError Traceback (most recent call last)
<ipython-input-96-653b0465e457> in <module>()
----> 1 selectedMachineView = machineView.drop([['GpuName','GPU1_TwoPartHwID']])
/usr/hdp/current/spark-client/python/pyspark/sql/dataframe.pyc in drop(self, col)
1257 jdf = self._jdf.drop(col._jc)
1258 else:
-> 1259 raise TypeError("col should be a string or a Column")
1260 return DataFrame(jdf, self.sql_ctx)
1261
TypeError: col should be a string or a Column
推荐答案
Simply with select
:
Simply with select
:
df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])
或者如果你真的想使用 drop
那么 reduce
应该可以解决问题:
or if you really want to use drop
then reduce
should do the trick:
from functools import reduce
from pyspark.sql import DataFrame
reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)
注意:
(执行时间差异):
在数据处理时间方面应该没有区别.虽然这些方法生成不同的逻辑计划,但物理计划完全相同.
There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.
然而,当我们分析驱动程序端代码时存在差异:
There is a difference however when we analyze driver-side code:
- 第一种方法只进行一次 JVM 调用,而第二种方法必须为必须排除的每一列调用 JVM
- 第一种方法生成逻辑计划,相当于物理计划.在第二种情况下,它被重写.
- 最后,Python 中的理解比
map
或reduce
等方法要快得多 - Spark 2.x+ 支持
drop
中的多列.请参阅 SPARK-11884(在 DataFrame API 中删除多个列em>) 和 SPARK-12204(在SparkR) 以获取详细信息.
- the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
- the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
- finally comprehensions are significantly faster in Python than methods like
map
orreduce
- Spark 2.x+ supports multiple columns in
drop
. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.
这篇关于如何在 Python 中排除 Spark 数据框中的多列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文