从Spark DataFrame中选择特定列 [英] Select Specific Columns from Spark DataFrame

查看:373
本文介绍了从Spark DataFrame中选择特定列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已将CSV数据加载到Spark DataFrame中.

I have loaded CSV data into a Spark DataFrame.

我需要将此数据帧切成两个不同的数据帧,其中每个数据帧都包含原始数据帧中的一组列.

I need to slice this dataframe into two different dataframes, where each one contains a set of columns from the original dataframe.

如何基于列选择Spark数据框中的子集?

How do I select a subset into a Spark dataframe, based on columns ?

推荐答案

如果要将数据帧分为两个不同的对象,请对两个不同的列进行选择.

If you want to split you dataframe into two different ones, do two selects on it with the different columns you want.

 val sourceDf = spark.read.csv(...)
 val df1 = sourceDF.select("first column", "second column", "third column")
 val df2 = sourceDF.select("first column", "second column", "third column")

请注意,这当然意味着对sourceDf进行两次评估,因此,如果它可以放入分布式内存中,并且您在两个数据帧中都使用了大多数列,则将其缓存是一个好主意.它有许多不需要的多余列,然后您可以先对其进行选择,然后再选择需要的列,以便将所有这些多余的数据存储在内存中.

Note that this of course means that the sourceDf would be evaluated twice, so if it can fit into distributed memory and you use most of the columns across both dataframes it might be a good idea to cache it. It it has many extra columns that you don't need, then you can do a select on it first to select on the columns you will need so it would store all that extra data in memory.

这篇关于从Spark DataFrame中选择特定列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆