F#:从seq中删除重复项很慢 [英] F#: removing duplicates from a seq is slow

查看:79
本文介绍了F#:从seq中删除重复项很慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个函数,该函数从给定的相等函数确定的范围内,从seq<'a>中剔除连续的重复项,但又有一点曲折:我需要从最后一次复制 last 复制以使其成为结果序列.例如,如果我有一个序列[("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)],并且我正在使用fun ((x1, y1),(x2, y2)) -> x1=x2检查是否相等,那么我想查看的结果是[("a", 1); ("b", 4); ("c", 5)].此功能的要点是,我要输入数据点,有时数据点合法地具有相同的时间戳,但我只关心最新的时间戳,因此我想丢弃具有相同时间戳的前面的时间戳.我实现的功能如下:

I am attempting to write a function that weeds out consecutive duplicates, as determined by a given equality function, from a seq<'a> but with a twist: I need the last duplicate from a run of duplicates to make it into the resulting sequence. For example, if I have a sequence [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)], and I am using fun ((x1, y1),(x2, y2)) -> x1=x2 to check for equality, the result I want to see is [("a", 1); ("b", 4); ("c", 5)]. The point of this function is that I have data points coming in, where sometimes data points legitimately have the same timestamp, but I only care about the latest one, so I want to throw out the preceding ones with the same timestamp. The function I have implemented is as follows:

let rec dedupeTakingLast equalityFn prev s = seq {
     match ( Seq.isEmpty s ) with
     | true -> match prev with 
                 | None -> yield! s
                 | Some value -> yield value
     | false ->
         match prev with 
         | None -> yield! dedupeTakingLast equalityFn (Some (Seq.head s)) (Seq.tail s) 
         | Some prevValue -> 
             if not (equalityFn(prevValue, (Seq.head s))) then 
                 yield prevValue
             yield! dedupeTakingLast equalityFn (Some (Seq.head s)) (Seq.tail s)
}

就实际完成这项工作而言,它是有效的:

In terms of actually doing the job, it works:

> [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)] 
  |> dedupeTakingLast (fun ((x1, y1),(x2, y2)) -> x1=x2) None 
  |> List.ofSeq;;
val it : (string * int) list = [("a", 1); ("b", 4); ("c", 5)]

但是,就性能而言,这是一场灾难:

However, in terms of performance, it's a disaster:

> #time
List.init 1000 (fun _ -> 1) 
|> dedupeTakingLast (fun (x,y) -> x = y) None 
|> List.ofSeq
#time;;    
--> Timing now on    
Real: 00:00:09.958, CPU: 00:00:10.046, GC gen0: 54, gen1: 1, gen2: 1
val it : int list = [1]    
--> Timing now off

很明显,我在这里做的很愚蠢,但是我看不到它是什么.性能影响从何而来?我意识到有更好的实现,但是我特别想了解为什么实现如此缓慢.

Clearly I'm doing something very dumb here, but I cannot see what it is. Where is the performance hit coming from? I realise that there are better implementations, but I am specifically interested in understanding why this implementation is so slow.

仅供参考,设法以仅依赖Seq.函数的功能样式来实现一个体面的实现.性能还可以,下面的 @gradbot 使用迭代器的命令式实现的时间大约是它的1.6倍.看来问题的根源是在我最初的努力中在递归调用中使用Seq.headSeq.tail.

FYI, managed to eke out a decent implementation in the functional style that relies on Seq. functions only. Performance is OK, taking about 1.6x the time of the imperative-style implementation by @gradbot below that uses iterators. It seems that the root of the problem is the use of Seq.head and Seq.tail in recursive calls in my original effort.

let dedupeTakingLastSeq equalityFn s = 
    s 
    |> Seq.map Some
    |> fun x -> Seq.append x [None]
    |> Seq.pairwise
    |> Seq.map (fun (x,y) -> 
            match (x,y) with 
            | (Some a, Some b) -> (if (equalityFn a b) then None else Some a)  
            | (_,None) -> x
            | _ -> None )
    |> Seq.choose id

推荐答案

性能问题来自对Seq.tail的嵌套调用.这是 Seq.tail <的源代码/a>

The performance issue comes from the nested calls to Seq.tail. Here's the source code to Seq.tail

[<CompiledName("Tail")>]
let tail (source: seq<'T>) =
    checkNonNull "source" source
    seq { use e = source.GetEnumerator() 
          if not (e.MoveNext()) then 
              invalidArg "source" (SR.GetString(SR.notEnoughElements))
          while e.MoveNext() do
              yield e.Current }

如果调用Seq.tail(Seq.tail(Seq.tail(...))),则编译器无法优化由GetEnumerator()创建的枚举数.后续返回的元素必须遍历每个嵌套序列和枚举器.这导致每个返回的元素必须在所有先前创建的序列中冒泡,并且所有这些序列也必须增加其内部状态.最终结果是运行时间为O(n ^ 2)而不是线性O(n).

If you call Seq.tail(Seq.tail(Seq.tail(...))) the compiler has no way of optimizing out the enumerators that are created by GetEnumerator(). Subsequent returned elements have to go through every nested sequence and enumerator. This results in every returned element having to bubble up through all previously created sequences and all of these sequences have to increment their internal state as well. The net result is a running time of O(n^2) instead of linear O(n).

不幸的是,目前尚无方法以F#的功能样式来表示它.您可以使用列表(x :: xs),但不能使用序列.在该语言获得更好的序列本机支持之前,请不要在递归函数中使用Seq.tail.

Unfortunately there is currently no way to represent this in a functional style in F#. You can with a list (x::xs) but not for a sequence. Until the language gets better native support for sequences, don't use Seq.tail in recursive functions.

使用单个枚举器将解决性能问题.

Using a single enumerator will fix the performance problem.

let RemoveDuplicatesKeepLast equals (items:seq<_>) =
    seq {
        use e = items.GetEnumerator()

        if e.MoveNext() then
            let mutable previous = e.Current

            while e.MoveNext() do
                if not (previous |> equals e.Current) then 
                    yield previous
                previous <- e.Current

            yield previous
    }

let test = [("a", 1); ("b", 2); ("b", 3); ("b", 4); ("c", 5)]
let FirstEqual a b = fst a = fst b

RemoveDuplicatesKeepLast FirstEqual test
|> printf "%A"

// output
// seq [("a", 1); ("b", 4); ("c", 5)]

此答案的第一个版本具有上述代码的递归版本,没有任何变异.

The first version of this answer has a recursive version of the above code without mutation.

这篇关于F#:从seq中删除重复项很慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆