基于列索引的 Spark Dataframe 选择 [英] Spark Dataframe select based on column index
问题描述
如何在 Scala 中选择具有特定索引的数据帧的所有列?
How do I select all the columns of a dataframe that has certain indexes in Scala?
例如,如果一个数据框有 100 列,而我只想提取列 (10,12,13,14,15),该怎么做?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
下面从数据帧 df
中选择所有列,其具有在数组 colNames 中提到的列名:
Below selects all columns from dataframe df
which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
如果有类似的,colNos 数组里面有
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
如何转换上述 df.select
以仅获取特定索引处的那些列.
How do I transform the above df.select
to fetch only those columns at the specific indexes.
推荐答案
您可以map
覆盖columns
:
import org.apache.spark.sql.functions.col
df.select(colNos map df.columns map col: _*)
或:
df.select(colNos map (df.columns andThen col): _*)
或:
df.select(colNos map (col _ compose df.columns): _*)
上面显示的所有方法都是等效的,不会造成性能损失.以下映射:
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
colNos map df.columns
只是一个本地 Array
访问(每个索引的恒定时间访问) 和基于 select
变体的 String
或 Column
之间的选择不会影响执行计划:
is just a local Array
access (constant time access for each index) and choosing between String
or Column
based variant of select
doesn't affect the execution plan:
val df = Seq((1, 2, 3 ,4, 5, 6)).toDF
val colNos = Seq(0, 3, 5)
df.select(colNos map df.columns map col: _*).explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
df.select("_1", "_4", "_6").explain
== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]
这篇关于基于列索引的 Spark Dataframe 选择的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!