通过维护顺序来汇总重复记录,并且还包括重复记录 [英] Aggregate duplicate records by maintaining the order and also include duplicate records

查看:63
本文介绍了通过维护顺序来汇总重复记录,并且还包括重复记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解决一个有趣的问题,只需对诸如总和,计数等的聚合进行groupBy很容易.但是这个问题稍有不同.让我解释一下:

I am trying to solve an interesting problem, it's easy to just do a groupBy for aggregation like sum, count etc. But this problem is slightly different. Let me explain:

这是我的元组列表:

val repeatSmokers: List[(String, String, String, String, String, String)] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
    ("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
  )

这些记录的架构为(Idnumber, name, test_code, year, amount).从这些元素中,我只希望重复记录,我们在上面列表中定义唯一组合的方式是采用(sachin, kita MR.,56308)名称和test_code组合.这意味着,如果重复使用相同的名称和test_code,则为重复吸烟者记录.为简单起见,您只能将test_code假定为唯一值,如果重复,则可以说它是重复吸烟者记录.

The schema for these records are (Idnumber, name, test_code, year, amount). From these elements, I want only repeated records, the way we define unique combination in the above list is by taking (sachin, kita MR.,56308) name and test_code combination. Which means if the same name and test_code is repeating it's a repeat smoker record. For simplicity you can assume only test_code as unique value, if it repeats you can say it's a repeat smoker record.

下面是确切的输出:

below is the exact output:

ID76182,27539,1990,255,1 
ID76182,27539,1990,365,2
ID76182,45873,1990,20,1 
ID76182,45873,1990,6770,2 
ID76182,45873,1990,9370,3
ID76182,49337,1990,200,1
ID76182,49337,1990,570,2
ID76182,47542,1990,280,1
ID76182,47542,1990,536,2

最后,这里具有挑战性的部分是要保持第二次重复吸烟者记录的顺序和总和,并增加发生次数.

Finally, the challenging part here is to maintain the order and aggregate sum on every second repeat smoker record and also add occurences.

例如:此记录架构为:ID76182,47542,1990,536,2

For example: this record schema is: ID76182,47542,1990,536,2

IDNumber,测试代码,年份,金额,发生次数

IDNumber,test_code,year,amount,occurences

因为它发生了两次,所以我们在上面看到了2.

since it occured twice we see 2 above.

注意:

输出可以是任何集合的列表,但其格式应与我上面提到的相同

output can be a list are any collection but it should be in the same format I mentioned above

推荐答案

所以这是Scala中的一些代码,但这实际上是只是用Scala编写的Java代码:

So here is some code in Scala but it is really a Java code just written in Scala:

import java.util.ArrayList
import java.util.LinkedHashMap
import scala.collection.convert._


type RawRecord = (String, String, String, String, String, String)
type Record = (String, String, String, String, Int, Int)
type RecordKey = (String, String, String, String)
type Output = (String, String, String, String, Int, Int, Int)
val keyF: Record => RecordKey = r => (r._1, r._2, r._3, r._4)
val repeatSmokersRaw: List[RawRecord] =
  List(
    ("ID76182", "sachin", "kita MR.", "56308", "1990", "300"),
    ("ID76182", "KOUN", "Jana MR.", "56714", "1990", "100"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "255"),
    ("ID76182", "GANGS", "SKILL", "27539", "1990", "110"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "20"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "6750"),
    ("ID76182", "DOWNES", "RYAN", "47542", "1990", "2090"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "200"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "280"),
    ("ID76182", "JAMES", "JIM", "30548", "1990", "300"),
    ("ID76182", "KIMMELSHUE", "RUTH", "55345", "1990", "2600"),
    ("ID76182", "DRAGON", "WARS", "49337", "1990", "370"),
    ("ID76182", "COOPER", "ANADA", "45873", "1990", "2600"),
    ("ID76182", "SEMI", "GAUTAM A MR.", "45873", "1990", "2600"),
    ("ID76182", "HULK", "PAIN MR.", "47542", "1990", "256")
  )
val repeatSmokers = repeatSmokersRaw.map(r => (r._1, r._2, r._3, r._4, r._5.toInt, r._6.toInt))

val acc = new LinkedHashMap[RecordKey, (util.ArrayList[Output], Int, Int)]
repeatSmokers.foreach(r => {
  val key = keyF(r)
  var cur = acc.get(key)
  if (cur == null) {
    cur = (new ArrayList[Output](), 0, 0)
  }
  val nextCnt = cur._2 + 1
  val sum = cur._3 + r._6
  val output = (r._1, r._2, r._3, r._4, r._5, sum, nextCnt)
  cur._1.add(output)
  acc.put(key, (cur._1, nextCnt, sum))
})
val result = acc.values().asScala.filter(p => p._2 > 1).flatMap(p => p._1.asScala)
// or if you are clever you can merge filter and flatMap as
// val result = acc.values().asScala.flatMap(p => if (p._1.size > 1) p._1.asScala else Nil)

println(result.mkString("\n"))

它打印

(ID76182,GANGS,SKILL,27539,1990,255,1)
(ID76182,GANGS,SKILL,27539,1990,365,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,20,1)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,6770,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,9370,3)
(ID76182,DRAGON,WARS,49337,1990,200,1)
(ID76182,DRAGON,WARS,49337,1990,570,2)
(ID76182,HULK,PAIN MR.,47542,1990,280,1)
(ID76182,HULK,PAIN MR.,47542,1990,536,2)

(ID76182,GANGS,SKILL,27539,1990,255,1)
(ID76182,GANGS,SKILL,27539,1990,365,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,20,1)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,6770,2)
(ID76182,SEMI,GAUTAM A MR.,45873,1990,9370,3)
(ID76182,DRAGON,WARS,49337,1990,200,1)
(ID76182,DRAGON,WARS,49337,1990,570,2)
(ID76182,HULK,PAIN MR.,47542,1990,280,1)
(ID76182,HULK,PAIN MR.,47542,1990,536,2)

此代码中的主要技巧是使用Java的 LinkedHashMap 作为累加器集合,因为它保留了插入顺序.另一个技巧是在内部存储一些列表(因为无论如何我都使用Java集合,所以我决定对内部累加器使用ArrayList,但是您可以使用任何喜欢的东西).因此,这个想法是建立一个==>吸烟者名单的地图,并另外为每个密钥库存储当前计数器和当前总和,以便可以将汇总的"吸烟者添加到名单中.构建映射时,请经过筛选以筛选出那些尚未累积至少2条记录的键,然后将列表映射转换为单个列表(这是使用LinkedHashMap的重要点,因为插入顺序在迭代过程中保留)

The main trick in this code is to use Java's LinkedHashMap as the accumulator collection because it preserves the order of insertion. Additional trick is to to store some lists inside (as I use Java-collections anyway I decided to use ArrayList for the inner accumulator but you can use anything you like). So the idea is to build a map of key => list of smokers and additionally for each key store current counter and current sum so "aggregated" smokers can be added to the list. When the map is built, go through it to filter out those keys that have not accumulated at least 2 records and then convert a map of lists to a single list (and this is the point where it is important that LinkedHashMap is used because insertion order is preserved during iteration)

这篇关于通过维护顺序来汇总重复记录,并且还包括重复记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆