如何选择以公共标签开头的所有列 [英] how to select all columns that starts with a common label
问题描述
我在 Spark 1.6 中有一个数据框,只想从中选择一些列.列名如下:
I have a dataframe in Spark 1.6 and want to select just some columns out of it. The column names are like:
colA, colB, colC, colD, colE, colF-0, colF-1, colF-2
我知道我可以这样做来选择特定的列:
I know I can do like this to select specific columns:
df.select("colA", "colB", "colE")
但是如何一次选择colA"、colB"和所有 colF-* 列?有没有像 Pandas?
but how to select, say "colA", "colB" and all the colF-* columns at once? Is there a way like in Pandas?
推荐答案
首先用 df.columns
抓取列名,然后过滤到你想要的列名 .filter(_.startsWith("colF"))
.这为您提供了一个字符串数组.但是选择需要 select(String, String*)
.幸好select for columns是select(Column*)
,所以最后用.map(df(_))
把Strings转成Columns,最后把Array of Columns转成带有 : _*
的 var arg.
First grab the column names with df.columns
, then filter down to just the column names you want .filter(_.startsWith("colF"))
. This gives you an array of Strings. But the select takes select(String, String*)
. Luckily select for columns is select(Column*)
, so finally convert the Strings into Columns with .map(df(_))
, and finally turn the Array of Columns into a var arg with : _*
.
df.select(df.columns.filter(_.startsWith("colF")).map(df(_)) : _*).show
这个过滤器可以做得更复杂(和 Pandas 一样).然而,这是一个相当丑陋的解决方案(IMO):
This filter could be made more complex (same as Pandas). It is however a rather ugly solution (IMO):
df.select(df.columns.filter(x => (x.equals("colA") || x.startsWith("colF"))).map(df(_)) : _*).show
如果其他列的列表是固定的,您还可以将列名的固定数组与过滤数组合并.
If the list of other columns is fixed you could also merge a fixed array of columns names with filtered array.
df.select((Array("colA", "colB") ++ df.columns.filter(_.startsWith("colF"))).map(df(_)) : _*).show
这篇关于如何选择以公共标签开头的所有列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!