如何删除在 Pig 中重复的数据行 [英] how to delete the rows of data which is repeating in Pig

查看:35
本文介绍了如何删除在 Pig 中重复的数据行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

"YouTube 倒带:2017 年的形状 |#YouTubeRewind"137843120 3014479 1602383 817582

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 137843120 3014479 1602383 817582

"YouTube 倒带:2017 年的形状 |#YouTubeRewind"125431369 2912715 1545018 807558

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 125431369 2912715 1545018 807558

"YouTube 倒带:2017 年的形状 |#YouTubeRewind"113876217 2811217 1470387 787174

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 113876217 2811217 1470387 787174

"YouTube 倒带:2017 年的形状 |#YouTubeRewind"100911567 2656678 1353655 682890

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 100911567 2656678 1353655 682890

Marvel Studios 的复仇者联盟:无限战争官方预告片";89930713 2606665 53011 347982

"Marvel Studios' Avengers: Infinity War Official Trailer" 89930713 2606665 53011 347982

Marvel Studios 的复仇者联盟:无限战争官方预告片";87450245 2584675 52176 341571

"Marvel Studios' Avengers: Infinity War Official Trailer" 87450245 2584675 52176 341571

Marvel Studios 的复仇者联盟:无限战争官方预告片";84281319 2555414 51008 339708

"Marvel Studios' Avengers: Infinity War Official Trailer" 84281319 2555414 51008 339708

Marvel Studios 的复仇者联盟:无限战争官方预告片";80360459 2513103 49170 335920

"Marvel Studios' Avengers: Infinity War Official Trailer" 80360459 2513103 49170 335920

"YouTube 倒带:2017 年的形状 |#YouTubeRewind"75969469 2251826 1127811 827755

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 75969469 2251826 1127811 827755

Marvel Studios 的复仇者联盟:无限战争官方预告片";74789251 2444960 46172 330710

"Marvel Studios' Avengers: Infinity War Official Trailer" 74789251 2444960 46172 330710

Marvel Studios 的复仇者联盟:无限战争官方预告片";66637636 2331359 41154 316185

"Marvel Studios' Avengers: Infinity War Official Trailer" 66637636 2331359 41154 316185

Marvel Studios 的复仇者联盟:无限战争官方预告片";56367282 2157741 34078 303178

"Marvel Studios' Avengers: Infinity War Official Trailer" 56367282 2157741 34078 303178

"YouTube 倒带:2017 年的形状 |#YouTubeRewind"52611730 1891822 884963 702784

"YouTube Rewind: The Shape of 2017 | #YouTubeRewind" 52611730 1891822 884963 702784

致我们的女儿"51243149 0 0 0

"To Our Daughter" 51243149 0 0 0

致我们的女儿"48635732 0 0 0

"To Our Daughter" 48635732 0 0 0

在上面的数据中有 2 列,其中一个是title";和其他是视图、喜欢、不喜欢、comment_count.

in above data there is 2 columns one is "title" and other are views, likes, dislikes, comment_count.

如何使用过滤器去除重复数据我想删除具有相同标题:"的数据,并保留视图最高的数据

how to use filter and remove repeating data i want to remove the data which is having same "title: and keep the data with highest views

推荐答案

如果要保留 MAX 个赞对应的记录的所有字段,则必须执行以下操作:

If you want to retain all fields of the record corresponding to the MAX likes, you would have to do something like so:

dataAll = LOAD 'path' USING PigStorage('\t') AS (title:chararray, views:long, likes:long, dislikes:long, comment_count:long);

--group the data by title so that all records belonging to a title fall into a bag in the same record
dataGrouped = GROUP dataAll BY title;

--Using a nested foreach, order the contents of the bag by likes and pick the top record
dataDeduped = FOREACH dataGrouped {
                 soredtedByLikes = ORDER dataAll BY likes DESC;
                 maxLikesRecord = LIMIT soredtedByLikes 1;
                 GENERATE FLATTEN(maxLikesRecord);
              }

STORE dataDeduped INTO 'outputPath' USING PigStorage('\t');

嵌套 Foreach 在这种情况下非常有用.在此处查看更多相关信息:https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html(在该链接中搜索嵌套的 foreach).

Nested Foreach comes in pretty useful in such situations. Checkout more about it here: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html (Search for nested foreach in that link).

这篇关于如何删除在 Pig 中重复的数据行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆