在 Pig 中分组后选择字段 [英] Selecting fields after grouping in Pig

查看:25
本文介绍了在 Pig 中分组后选择字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我可能遗漏了一些非常微不足道的东西,但我无法让它发挥作用.我有一个电影"对象,有标题、演员、年份和角色.现在我想要的是带有标题的结果,以及一个包含演员/角色对的嵌套包.

There's probably something very trivial that I'm missing, but I just can't get this to work. I have a "movies" object, with title, actor, year and role. Now what I want, is to have results with the title, along with a nested bag containing actor/role pairs.

如果我只做按标题分组电影,我最终会得到像 (title, {movie objects}) 这样的结果,这将是完美的,除了标题和年份也出现在电影对象中那里.我只想要演员和角色.

If I just do group movies by title, I end up with results like (title, {movie objects}) which would be perfect, except that the title and year also appear in the movie objects there. I want just the actor and role.

我也试过 foreach movie_groups generate group,movies.actor,movies.role 但我最终得到了 (title, {all actor}, {all roles}) 这显然是错误的.

I also tried foreach movie_groups generate group, movies.actor, movies.role but then I end up with (title, {all actors}, {all roles}) which is obviously wrong.

在 SQL 中,这将是如此微不足道,以至于我不禁为无法弄清楚这一点而感到非常愚蠢.有人有建议吗?

In SQL this would be so trivial that I can't help but feel incredibly stupid for not being able to figure this out. Would anyone have a suggestion?

推荐答案

查看电影的格式会很有帮助,但我假设它是这样的:

It would be helpful to see the format of movies, but I'm assuming it is something like this:

MovieTitle1 Year1 Actor1 Role1
MovieTitle1 Year2 Actor2 Role2
etc.

在那种情况下,我会这样做:

In that case, I would do it like this:

result = FOREACH (GROUP movies BY title)  
         GENERATE FLATTEN(group), movies.(actor, role) AS actors ;

此外,您提到电影也包含年份.如果您不需要该字段,那么首先只投影您需要的字段(标题、演员、角色)可能是值得的.

Also, you mention that the movies contain the year as well. If you do not need that field it might be worthwhile to project only the fields that you need (title, actor, role) first.

这篇关于在 Pig 中分组后选择字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆