使用Spark检测大型数据集中的重复连续值 [英] Detecting repeating consecutive values in large datasets with Spark
问题描述
Cheerz,
最近,我一直在尝试使用Spark,并且到目前为止,我已经观察到了非常有趣的结果,但是目前,我陷入了著名的 groupByKey
OOM问题.基本上,它所做的工作是尝试在大型数据集中搜索测量值连续至少N次连续增加的时间段.通过将结果写入磁盘,我设法解决了这个问题,但是应用程序现在的运行速度要慢得多(这是由于磁盘IO所预期的).现在的问题是:是否有其他内存有效策略可以运行排序后的数据,并检查至少N次连续观察中相邻值(对于同一键)是否在增加,而无需使用groupByKey方法?
Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey
OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is there any other memory efficient strategy where I can run sorted data and check whether adjacent values(for the same key) are increasing in at least N consecutive observations, without recurring to groupByKey method?
我已经设计了一种算法来使用 reduceByKey
来做到这一点,但是有一个问题,reduce似乎忽略了数据排序,最后大喊大叫完全错误的结果.
I have designed an algorithm to do it with reduceByKey
, but there is one problem, reduce seems to ignore data ordering and yells completely wrong results at the end.
感谢任何想法,谢谢.
推荐答案
有几种方法可以解决此问题:
There are a few ways you can approach this problem:
-
repartitionAndSortWithinPartitions
,带有自定义分区程序和顺序:
repartitionAndSortWithinPartitions
with custom partitioner and ordering:
-
keyBy
(名称,时间戳)对 - 创建仅考虑名称的自定义分区程序
-
repartitionAndSortWithinPartitions
使用自定义分区程序 - 使用
mapPartitions
遍历数据并产生匹配序列
keyBy
(name, timestamp) pairs- create custom partitioner which considers only the name
repartitionAndSortWithinPartitions
using custom partitioner- use
mapPartitions
to iterate over data and yield matching sequences
sortBy(Key)
-与第一个解决方案相似,但提供更高的粒度,但需要额外的后处理.
sortBy(Key)
- this is similar to the first solution but provides higher granularity at the cost of additional post-processing.
-
keyBy
(名称,时间戳)对 -
sortByKey
- 使用
mapPartitionsWithIndex
处理各个分区,并跟踪每个分区的前导/尾随模式 - 调整最终结果以包括跨越多个分区的模式
keyBy
(name, timestamp) pairssortByKey
- process individual partitions using
mapPartitionsWithIndex
keeping track of leading / trailing patterns for each partition - adjust final results to include patterns which span over more than one partitions
使用来自 mllib.rdd.RDDFunctions
的 sliding
在排序后的数据上创建固定大小的窗口.
create fixed sized windows over sorted data using sliding
from mllib.rdd.RDDFunctions
.
-
sortBy
(名称,时间戳记) - 创建滑动式RDD并筛选覆盖多个
名称
的窗口 - 检查是否有任何窗口包含所需的图案.
这篇关于使用Spark检测大型数据集中的重复连续值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!