F#用于高频实时流数据的不可变数据结构 [英] F# Immutable data structures for high frequency real-time streaming data

查看:181
本文介绍了F#用于高频实时流数据的不可变数据结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正处于一个涉及流数据实时和历史分析的f#项目的开始。数据包含在c#对象(见下文)中,作为标准的.net事件的一部分发送。实际上,我们通常收到的事件数量可能会从每个仪器每秒钟不到1秒/秒到大约800次事件发生很大变化,因此可能会非常突发。一个典型的日子可能会累积500万行/元每个insturment



C#事件的一般版本的数据结构如下所示:

  public enum MyType {type0 = 0,type1 = 1} 

public class dataObj
{
public int myInt = 0;
public double myDouble;
public string myString;
public DateTime myDataTime;
public MyType type;
public object myObj = null;

}

我们计划以两种方式在f#中使用此数据结构:


  1. 使用受监督&&无监督机器学习(CRF,聚类模型等)

  2. 使用上述模型实时分类数据流

随着我们添加更多事件,数据结构需要增长。这排除了 array ,因为它不允许调整大小,尽管它可以用于历史分析。数据结构还需要能够快速访问最近的数据,并且理想地需要能够跳回数据x点。由于线性查找时间,因为没有随机访问元素,所以只需转发遍历,就排除了列表



根据此帖 Set< T> 可能是一个不错的选择...



>... Vanilla Set<'a>做的不够,我更喜欢'Set' '列表',所以你总是有O(lg n)访问最大和最小的项目,允许你通过插入日期/时间来排序你的集合,以有效地访问最新和最旧的项目...



编辑:尹竺的回应给了我更多的清晰度,正是我在问的。我修改了剩下的帖子来反映这一点。此外,这个问题的前一版本由于对历史分析的要求的引入而变得混乱。我已经省略了他们。



以下是实时流程步骤的细目:


  1. 收到实时事件

  2. 此事件放在数据结构中。 这是我们尝试确定的数据结构。应该是一个 Set< T> 或其他一些结构?

  3. 元素的一部分被提取或以某种方式迭代为了特征生成的目的。这将是数据结构的最后n行/元素(即最后1000个事件或10,000个事件)或最后x secs / mins中的所有元素(即最近10分钟内的所有事件)。理想情况下,我们希望能够有效地实现这一目标。特别地,允许第n个元素的随机访问而不通过所有其他元素的迭代的数据结构是有价值的。

  4. 生成模型的特征并将其发送到用于评估的模型

  5. 我们可能修剪旧数据的数据结构,以提高性能。

所以问题是用于存储我们将用于生成特征的实时流事件的最佳数据结构是什么。

解决方案

您应该考虑 FSharpx.Collections.Vector 。向量T将给出类似Array的特征,包括索引的O(log32(n))查找和更新,它在O(1)的分散距离内,以及在序列的末尾添加新的元素。 Vector的另一个实现可以从 Solid Vector 的F#使用。很好的文档记录,一些功能在大规模(元素数> 10K)下可快速执行4倍。这两种实现都能够很好地执行,甚至可能超过1M个元素。


We are at the beginning of an f# project involving real-time and historical analysis of streaming data. The data is contained in a c# object (see below) and is sent as part of a standard .net event. In real-time, the number of events we typically receive can vary greatly from less than 1/sec to upwards of around 800 events per second per instrument and thus can be very bursty. A typical day might accumulate 5 million rows/elements per insturment

A generic version of the C# event's data structure looks like this:

public enum MyType { type0 = 0, type1 = 1}

public class dataObj
{
    public int myInt= 0;
    public double myDouble;
    public string myString;
    public DateTime myDataTime;
    public MyType type;
    public object myObj = null;

}

We plan to use this data structure in f# in two ways:

  1. Historical analysis using supervised && unsupervised machine learning (CRFs, clustering models, etc)
  2. Real-time classification of data streams using the above models

The data structure needs to be able to grow as we add more events. This rules out array<t> because it does not allow for resizing, though it could be used for the historical analysis. The data structure also needs to be able to quickly access recent data and ideally needs to be able to jump to data x points back. This rules out Lists<T> because of the linear lookup time and because there is no random access to elements, just "forward-only" traversal.

According to this post, Set<T> may be a good choice...

> " ...Vanilla Set<'a> does a more than adequate job. I'd prefer a 'Set' over a 'List' so you always have O(lg n) access to the largest and smallest items, allowing you to ordered your set by insert date/time for efficient access to the newest and oldest items..."

EDIT: Yin Zhu response gave me some additional clarity into exactly what I was asking. I have edited the remainder of the post to reflect this. Also, the previous version of this question was muddied by the introduction of requirements for historical analysis. I have omitted them.

Here is a breakdown of the steps of the real-time process:

  1. A realtime event is received
  2. This event is placed in a data structure. This is the data structure that we are trying to determine. Should it be a Set<T>, or some other structure?
  3. A subset of the elements are either extracted or somehow iterated over for the purpose of feature generation. This would either be the last n rows/elements of the data structure (ie. last 1000 events or 10,000 events) or all the elements in the last x secs/mins (i.e all the events in the last 10 min). Ideally, we want a structure that allows us to do this efficiently. In particular, a data structure that allows for random access of the nth element without iteration through all the others elements is of value.
  4. Features for the model are generated and sent to a model for evaluation.
  5. We may prune the data structure of older data to improve performance.

So the question is what is the best data structure to use for storing the real-time streaming events that we will use to generated features.

解决方案

You should consider FSharpx.Collections.Vector. Vector<T> will give you Array-like features, including indexed O(log32(n)) look-up and update, which is within spitting distance of O(1), as well as adding new elements to the end of your sequence. There is another implementation of Vector which can be used from F# at Solid Vector. Very well documented and some functions perform up to 4X faster at large scale (element count > 10K). Both implementations perform very well up to and possibly beyond 1M elements.

这篇关于F#用于高频实时流数据的不可变数据结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆