使用Spark检测大型数据集中的重复连续值 [英] Detecting repeating consecutive values in large datasets with Spark

查看：119 发布时间：2021/4/8 20:11:15 java apache-spark reduce

本文介绍了使用Spark检测大型数据集中的重复连续值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Cheerz，

最近，我一直在尝试使用Spark，并且到目前为止，我已经观察到了非常有趣的结果，但是目前，我陷入了著名的 groupByKey OOM问题.基本上，它所做的工作是尝试在大型数据集中搜索测量值连续至少N次连续增加的时间段.通过将结果写入磁盘，我设法解决了这个问题，但是应用程序现在的运行速度要慢得多(这是由于磁盘IO所预期的).现在的问题是:是否有其他内存有效策略可以运行排序后的数据，并检查至少N次连续观察中相邻值(对于同一键)是否在增加，而无需使用groupByKey方法?

Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is there any other memory efficient strategy where I can run sorted data and check whether adjacent values(for the same key) are increasing in at least N consecutive observations, without recurring to groupByKey method?

我已经设计了一种算法来使用 reduceByKey 来做到这一点，但是有一个问题，reduce似乎忽略了数据排序，最后大喊大叫完全错误的结果.

I have designed an algorithm to do it with reduceByKey, but there is one problem, reduce seems to ignore data ordering and yells completely wrong results at the end.

感谢任何想法，谢谢.

使用Spark检测大型数据集中的重复连续值 [英] Detecting repeating consecutive values in large datasets with Spark

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用Spark检测大型数据集中的重复连续值 [英] Detecting repeating consecutive values in large datasets with Spark

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭