在PowerShell中比较两个更大的文本数组 [英] Comparing two larger text arrays in PowerShell

查看:150
本文介绍了在PowerShell中比较两个更大的文本数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数组,我想区别一下。我在COMPARE-OBJECT上取得了一些成功,但是对于大型阵列来说太慢了。在此示例中,$ ALLVALUES和$ ODD是我的两个数组。

I have two arrays that I would like to take the difference between. I had some success with COMPARE-OBJECT, but is too slow for larger arrays. In this example $ALLVALUES and $ODD are my two arrays.

我以前能够使用FINDSTR
ex高效地完成此操作。 FINDSTR / V /G:ODD.txt ALLVALUES.txt> EVEN.txt FINDSTR在2秒内完成了110,000个元素的处理。 (甚至必须从磁盘读取和写入)

I used to be able to do this efficiently using FINDSTR ex. FINDSTR /V /G:ODD.txt ALLVALUES.txt > EVEN.txt FINDSTR finished this in under 2 seconds for 110,000 elements. (even had to read and write from the disk)

我试图恢复FINDSTR的性能,它将在ALLVALUES.txt中为我提供一切匹配ODD.txt(在这种情况下为我提供了EVEN值)

I'm trying to get back to the FINDSTR performance where it would give me everything in ALLVALUES.txt that did NOT match ODD.txt (giving me the EVEN values in this case)

注意:这个问题与ODD或EVEN无关,只是一个可以快速直观地看到的实际示例

NOTE: This question is not about ODD or EVEN, only a practical example that can be quickly and visually verified that it is working as desired.

这是我一直在使用的代码。使用COMPARE-OBJECT,100,000花费了200秒,而我的计算机上FINDSTR花费了2秒。我认为 PowerShell 中有一种更为优雅的方法。谢谢您的帮助。

Here is the code that I have been playing with. Using COMPARE-OBJECT, 100,000 took like 200 seconds vs 2 seconds for FINDSTR on my computer. I'm thinking there is a much more elegant way to do this in PowerShell. Thanks for your help.

# -------  Build the MAIN array
$MIN = 1
$MAX = 100000
$PREFIX = "AA"

$ALLVALUES = while ($MIN -le $MAX) 
{
   "$PREFIX{0:D6}" -f $MIN++
}


# -------  Build the ODD values from the MAIN array
$MIN = 1
$MAX = 100000
$PREFIX = "AA"

$ODD = while ($MIN -le $MAX) 
{
   If ($MIN%2) {
      "$PREFIX{0:D6}" -f $MIN++
   }
  ELSE {
    $MIN++
   }
}

Measure-Command{$EVEN = Compare-Object -DifferenceObject $ODD -ReferenceObject $ALLVALUES -PassThru}


推荐答案

数组是对象,而不仅仅是findstr进程的简单文本段。

字符串数组最快的区别是.NET3.5 + HashSet.SymmetricExceptWith

The arrays are objects, not just simple blobs of text that findstr processes.
The fastest diff of string arrays is .NET3.5+ HashSet.SymmetricExceptWith.

$diff = [Collections.Generic.HashSet[string]]$a
$diff.SymmetricExceptWith([Collections.Generic.HashSet[string]]$b)
$diffArray = [string[]]$diff

使用您的数据在i7 CPU上的100k元素为46 ms。

46 ms for 100k elements on i7 CPU using your data.

上面的代码省略重复值,因此如果输出中需要这些值,我认为我们将不得不使用慢得多的手动枚举。

The above code omits duplicate values so if those are needed in the output, I think we'll have to use a much much slower manual enumeration.

function Diff-Array($a, $b, [switch]$unique) {
    if ($unique.IsPresent) {
        $diff = [Collections.Generic.HashSet[string]]$a
        $diff.SymmetricExceptWith([Collections.Generic.HashSet[string]]$b)
        return [string[]]$diff
    }
    $occurrences = @{}
    foreach ($_ in $a) { $occurrences[$_]++ }
    foreach ($_ in $b) { $occurrences[$_]-- }
    foreach ($_ in $occurrences.GetEnumerator()) {
        $cnt = [Math]::Abs($_.value)
        while ($cnt--) { $_.key }
    }
}

用法:

$diffArray = Diff-Array $ALLVALUES $ODD

340毫秒,比哈希集慢8倍,但比Compare-Object快110倍!

340 ms, 8x slower than hashset but 110x faster than Compare-Object!

最后,我们可以为字符串/数字数组制作一个更快的Compare-Object:

And lastly, we can make a faster Compare-Object for arrays of strings/numbers:

function Compare-StringArray($a, $b, [switch]$unsorted) {
    $occurrences = if ($unsorted.IsPresent) { @{} }
                   else { [Collections.Generic.SortedDictionary[string,int]]::new() }
    foreach ($_ in $a) { $occurrences[$_]++ }
    foreach ($_ in $b) { $occurrences[$_]-- }
    foreach ($_ in $occurrences.GetEnumerator()) {
        $cnt = $_.value
        if ($cnt) {
            $diff = [PSCustomObject]@{
                InputObject = $_.key
                SideIndicator = if ($cnt -lt 0) { '=>' } else { '<=' }
            }
            $cnt = [Math]::Abs($cnt)
            while ($cnt--) {
                $diff
            }
        }
    }
}

100k元素:比Compare-Object快20-28倍,完成2100ms / 1460ms(未排序)

10k元素:快2-3x,完成210ms / 162ms(未排序)

100k elements: 20-28x faster than Compare-Object, completes in 2100ms / 1460ms (unsorted)
10k elements: 2-3x faster, completes in 210ms / 162ms (unsorted)

这篇关于在PowerShell中比较两个更大的文本数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆