在星火使用GROUPBY和获取回数据框 [英] Using groupBy in Spark and getting back to a DataFrame
问题描述
在与火花数据帧时使用Scala我有困难。如果我有,我想提取的唯一条目的列,当我使用一个数据帧 GROUPBY
我没有得到一个数据帧了。
I have a difficulty when working with data frames in spark with Scala. If I have a data frame that I want to extract a column of unique entries, when I use groupBy
I don't get a data frame back.
例如,我有一个数据帧
称为日志有以下形式:
For example, I have a DataFrame
called logs that has the following form:
machine_id | event | other_stuff
34131231 | thing | stuff
83423984 | notathing | notstuff
34131231 | thing | morestuff
和我想的唯一的机器ID,其中事件是存储在一个新的数据帧的事情
来让我做某种一些过滤。使用
and I would like the unique machine ids where event is thing stored in a new DataFrame
to allow me to do some filtering of some kind. Using
val machineId = logs
.where($"event" === "thing")
.select("machine_id")
.groupBy("machine_id")
我得到的分组数据的VAL背这是一个痛苦的对接使用(或不知道如何正确使用这种类型的对象)。已经得到了独特的机器ID的这个名单,我再要在过滤另一个数据帧
来提取个人计算机ID的所有事件使用此。
I get a val of Grouped Data back which is a pain in the butt to use (or I don't know how to use this kind of object properly). Having got this list of unique machine id's, I then want to use this in filtering another DataFrame
to extract all events for individual machine ids.
我可以看到我会想很经常做这种事情的基本工作流程是:
I can see I'll want to do this kind of thing fairly regularly and the basic workflow is:
- 从日志表中提取唯一的ID。
- 使用唯一的ID提取所有事件的特定编号。
- 使用某种分析,这一数据已被提取。
这是前两步我就AP preciate一些指导这里。
It's the first two steps I would appreciate some guidance with here.
我AP preciate这个例子是一种做作,但希望它说明我的问题是什么。这可能是我不知道有足够的了解 GroupedData code>对象或(如我希望)我缺少,使得这个简单的数据帧的东西。我使用的是建立在斯卡拉2.10.4火花1.5。
I appreciate this example is kind of contrived but hopefully it explains what my issue is. It may be I don't know enough about GroupedData
objects or (as I'm hoping) I'm missing something in data frames that makes this easy. I'm using spark 1.5 built on Scala 2.10.4.
感谢
推荐答案
只需使用不同的
不是 GROUPBY
:
val machineId = logs.where($"event"==="thing").select("machine_id").distinct
这将等同于SQL:
Which will be equivalent to SQL:
SELECT DISTINCT machine_id FROM logs WHERE event = 'thing'
GroupedData code>不打算直接使用。它提供了许多方法,其中
AGG
是最普遍的,它可用于应用不同的聚合函数,并将其转换回数据框
。在SQL方面,你有什么后,其中
和 GROUPBY
等同于像这样
GroupedData
is not intended to be used directly. It provides a number of methods, where agg
is the most general, which can be used to apply different aggregate functions and convert it back to DataFrame
. In terms of SQL what you have after where
and groupBy
is equivalent to something like this
SELECT machine_id, ... FROM logs WHERE event = 'thing' GROUP BY machine_id
其中, ...
已被 AGG
提供或等效的方法。
where ...
has to be provided by agg
or equivalent method.
这篇关于在星火使用GROUPBY和获取回数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!