在星火使用GROUPBY和获取回数据框 [英] Using groupBy in Spark and getting back to a DataFrame

查看:209
本文介绍了在星火使用GROUPBY和获取回数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在与火花数据帧时使用Scala我有困难。如果我有,我想提取的唯一条目的列,当我使用一个数据帧 GROUPBY 我没有得到一个数据帧了。

I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy I don't get a data frame back.

例如,我有一个数据帧称为日志有以下形式:

For example, I have a DataFrame called logs that has the following form:

machine_id  | event     | other_stuff
 34131231   | thing     |   stuff
 83423984   | notathing | notstuff
 34131231   | thing    | morestuff

和我想的唯一的机器ID,其中事件是存储在一个新的数据帧的事情来让我做某种一些过滤。使用

and I would like the unique machine ids where event is thing stored in a new DataFrame to allow me to do some filtering of some kind. Using

val machineId = logs
  .where($"event" === "thing")
  .select("machine_id")
  .groupBy("machine_id")

我得到的分组数据的VAL背这是一个痛苦的对接使用(或不知道如何正确使用这种类型的对象)。已经得到了独特的机器ID的这个名单,我再要在过滤另一个数据帧来提取个人计算机ID的所有事件使用此。

I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame to extract all events for individual machine ids.

我可以看到我会想很经常做这种事情的基本工作流程是:

I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:


  1. 从日志表中提取唯一的ID。

  2. 使用唯一的ID提取所有事件的特定编号。

  3. 使用某种分析,这一数据已被提取。

这是前两步我就AP preciate一些指导这里。

It's the first two steps I would appreciate some guidance with here.

我AP preciate这个例子是一种做作,但希望它说明我的问题是什么。这可能是我不知道有足够的了解 GroupedData 对象或(如我希望)我缺少,使得这个简单的数据帧的东西。我使用的是建立在斯卡拉2.10.4火花1.5。

I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.

感谢

推荐答案

只需使用不同的不是 GROUPBY

val machineId = logs.where($"event"==="thing").select("machine_id").distinct

这将等同于SQL:

Which will be equivalent to SQL:

SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'

GroupedData 不打算直接使用。它提供了许多方法,其中 AGG 是最普遍的,它可用于应用不同的聚合函数,并将其转换回数据框。在SQL方面,你有什么后,其中 GROUPBY 等同于像这样

GroupedData is not intended to be used directly. It provides a number of methods, where agg is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame. In terms of SQL what you have after where and groupBy is equivalent to something like this

SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id

其中, ... 已被 AGG 提供或等效的方法。

where ... has to be provided by agg or equivalent method.

这篇关于在星火使用GROUPBY和获取回数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆