使用Spark检测大型数据集中的重复连续值 [英] Detecting repeating consecutive values in large datasets with Spark

查看:119
本文介绍了使用Spark检测大型数据集中的重复连续值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Cheerz,

最近,我一直在尝试使用Spark,并且到目前为止,我已经观察到了非常有趣的结果,但是目前,我陷入了著名的 groupByKey OOM问题.基本上,它所做的工作是尝试在大型数据集中搜索测量值连续至少N次连续增加的时间段.通过将结果写入磁盘,我设法解决了这个问题,但是应用程序现在的运行速度要慢得多(这是由于磁盘IO所预期的).现在的问题是:是否有其他内存有效策略可以运行排序后的数据,并检查至少N次连续观察中相邻值(对于同一键)是否在增加,而无需使用groupByKey方法?

Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is there any other memory efficient strategy where I can run sorted data and check whether adjacent values(for the same key) are increasing in at least N consecutive observations, without recurring to groupByKey method?

我已经设计了一种算法来使用 reduceByKey 来做到这一点,但是有一个问题,reduce似乎忽略了数据排序,最后大喊大叫完全错误的结果.

I have designed an algorithm to do it with reduceByKey, but there is one problem, reduce seems to ignore data ordering and yells completely wrong results at the end.

感谢任何想法,谢谢.

推荐答案

有几种方法可以解决此问题:

There are a few ways you can approach this problem:

  • repartitionAndSortWithinPartitions ,带有自定义分区程序和顺序:

  • repartitionAndSortWithinPartitions with custom partitioner and ordering:

  • keyBy (名称,时间戳)对
  • 创建仅考虑名称的自定义分区程序
  • repartitionAndSortWithinPartitions 使用自定义分区程序
  • 使用 mapPartitions 遍历数据并产生匹配序列
  • keyBy (name, timestamp) pairs
  • create custom partitioner which considers only the name
  • repartitionAndSortWithinPartitions using custom partitioner
  • use mapPartitions to iterate over data and yield matching sequences

sortBy(Key)-与第一个解决方案相似,但提供更高的粒度,但需要额外的后处理.

sortBy(Key) - this is similar to the first solution but provides higher granularity at the cost of additional post-processing.

  • keyBy (名称,时间戳)对
  • sortByKey
  • 使用 mapPartitionsWithIndex 处理各个分区,并跟踪每个分区的前导/尾随模式
  • 调整最终结果以包括跨越多个分区的模式
  • keyBy (name, timestamp) pairs
  • sortByKey
  • process individual partitions using mapPartitionsWithIndex keeping track of leading / trailing patterns for each partition
  • adjust final results to include patterns which span over more than one partitions

使用来自 mllib.rdd.RDDFunctions sliding 在排序后的数据上创建固定大小的窗口.

create fixed sized windows over sorted data using sliding from mllib.rdd.RDDFunctions.

  • sortBy (名称,时间戳记)
  • 创建滑动式RDD并筛选覆盖多个名称
  • 的窗口
  • 检查是否有任何窗口包含所需的图案.

这篇关于使用Spark检测大型数据集中的重复连续值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆