有效地合并具有多个匹配键的大型对象数据集 [英] Efficiently merge large object datasets having mulitple matching keys

查看:79
本文介绍了有效地合并具有多个匹配键的大型对象数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Powershell脚本中,我有两个具有多列的数据集.并非所有这些列都是共享的.

In a Powershell script, I have two data sets that have multiple columns. Not all these columns are shared.

例如,数据集1:

A B    XY   ZY  
- -    --   --  
1 val1 foo1 bar1
2 val2 foo2 bar2
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6

和数据集2:

A B    ABC  GH  
- -    ---  --  
3 val3 foo3 bar3
4 val4 foo4 bar4
5 val5 foo5 bar5
6 val6 foo6 bar6
7 val7 foo7 bar7
8 val8 foo8 bar8

我想合并这两个数据集,并指定哪些列用作键(在我的简单情况下为A和B).预期结果是:

I want to merge these two dataset, specifying which columns act as key (A and B in my simple case). The expected result is:

A B    XY   ZY   ABC  GH  
- -    --   --   ---  --  
1 val1 foo1 bar1          
2 val2 foo2 bar2          
3 val3 foo3 bar3 foo3 bar3
4 val4 foo4 bar4 foo4 bar4
5 val5 foo5 bar5 foo5 bar5
6 val6 foo6 bar6 foo6 bar6
7 val7           foo7 bar7
8 val8           foo8 bar8

该概念与SQL交叉联接查询非常相似.

The concept is very similar to a SQL cross join query.

我已经能够成功编写一个合并对象的函数.不幸的是,计算的持续时间是指数的.

I have been able to successfully write a function that merge objects. Unfortunately, the duration of the computation is exponential.

如果我使用生成数据集:

If I generate my data sets using :

$dsLength = 10
$dataset1 = 0..$dsLength | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}

我得到这些结果:

  • $dsLength = 10 ==> 33ms(精细)
  • $dsLength = 100 ==> 89ms(精细)
  • $dsLength = 1000 ==> 1563ms(可以接受)
  • $dsLength = 5000 ==> 35764ms(太多)
  • $dsLength = 10000 ==> 138047ms(太多)
  • $dsLength = 20000 ==> 573614毫秒(太多了)
  • $dsLength = 10 ==> 33ms (fine)
  • $dsLength = 100 ==> 89ms (fine)
  • $dsLength = 1000 ==> 1563ms (acceptable)
  • $dsLength = 5000 ==> 35764ms (too much)
  • $dsLength = 10000 ==> 138047ms (too much)
  • $dsLength = 20000 ==> 573614ms (far too much)

当数据集很大(我的目标是2万个项目左右)时,如何有效地合并数据集?

How can I merge datasets efficiently when data sets are large (my target is around 20K items) ?

现在,我已经定义了以下功能:

Right now, I have these function defined :

function Merge-Objects{
    param(
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset1,
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset2,
        [Parameter()]
        [string[]]$Properties
    )

    $result = @()

    $ds1props = $Dataset1 | gm -MemberType Properties
    $ds2props = $Dataset2 | gm -MemberType Properties
    $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
    $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }

    foreach($row1 in $Dataset1){
        $result += $row1
        $ds2propsNotInDs1Props | % {
            $row1 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
        }
    }

    foreach($row2 in $Dataset2){
        $existing = foreach($candidate in $result){
            $match = $true
            foreach($prop in $Properties){
                if(-not ($row2.$prop -eq $candidate.$prop)){
                    $match = $false                   
                    break                  
                }
            }
            if($match){
                $candidate
                break
            }
        }
        if(!$existing){
            $ds1propsNotInDs2Props | % {
                $row2 | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
            $result += $row2
        }else{
            $ds2propsNotInDs1Props | % {
                $existing.$($_.Name) = $row2.$($_.Name)
            }

        }
    }

    $result
}

我这样称呼这些功能:

Measure-Command -Expression {

    $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 

}

我的感觉是缓慢是由于第二个循环,我尝试在每次循环中匹配一个现有行

My feeling is that the slowness is due to the second loop, where I try to match an existing row in each iteration

使用哈希作为索引的第二种方法.令人惊讶的是,它的事件比第一次尝试要慢

$dsLength = 1000
$dataset1 = 0..$dsLength | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}

function Get-Hash{
    param(
        [Parameter(Mandatory=$true)]
        [object]$InputObject,
        [Parameter()]
        [string[]]$Properties    
    )

    $InputObject | Select-object $properties | Out-String
}


function Merge-Objects{
    param(
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset1,
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset2,
        [Parameter()]
        [string[]]$Properties
    )

    $result = @()
    $index = @{}

    $ds1props = $Dataset1 | gm -MemberType Properties
    $ds2props = $Dataset2 | gm -MemberType Properties
    $allProps = $ds1props + $ds2props | select -Unique

    $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
    $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }

    $ds1index = @{}

    foreach($row1 in $Dataset1){
        $tempObject = new-object psobject
        $result += $tempObject
        $ds2propsNotInDs1Props | % {
            $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
        }
        $ds1props | % {
            $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row1.$($_.Name)
        }

        $hash1 = Get-Hash -InputObject $row1 -Properties $Properties
        $ds1index.Add($hash1, $tempObject)

    }

    foreach($row2 in $Dataset2){
        $hash2 = Get-Hash -InputObject $row2 -Properties $Properties

        if($ds1index.ContainsKey($hash2)){
            # merge object
            $existing = $ds1index[$hash2]
            $ds2propsNotInDs1Props | % {
                $existing.$($_.Name) = $row2.$($_.Name)
            }
            $ds1index.Remove($hash2)

        }else{
            # add object
            $tempObject = new-object psobject
            $ds1propsNotInDs2Props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
            $ds2props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $row2.$($_.Name)
            }
            $result += $tempObject
        }
    }

    $result
}

Measure-Command -Expression {

    $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 

}

[Edit2]将Measure-Command放在两个循环中会显示出第一个循环还很慢的事件.实际上,第一个循环占用了总时间的50%以上

Putting Measure-Commands around the two loops show that event the first loop is yet slow. Actually the first loop take more than 50% of the total time

推荐答案

我同意@Matt.使用哈希表-类似于以下内容.这应该在m + 2n时间而不是mn时间运行.

I agree with @Matt. Use a hashtable -- something like the below. This should run in m + 2n rather than mn time.

我的系统上的时间

原始解决方案

#10    TotalSeconds      :   0.07788
#100   TotalSeconds      :   0.37937
#1000  TotalSeconds      :   5.25092
#10000 TotalSeconds      : 242.82018
#20000 TotalSeconds      : 906.01584

这肯定看起来是O(n ^ 2)

This definitely looks O(n^2)

以下解决方案

#10    TotalSeconds      :  0.094
#100   TotalSeconds      :  0.425
#1000  TotalSeconds      :  3.757
#10000 TotalSeconds      : 45.652
#20000 TotalSeconds      : 92.918

这看起来是线性的.

解决方案

我使用了三种技术来提高速度:

I used three techniques to increase the speed:

  1. 切换到哈希表.这允许进行恒定时间的查找,因此您不必具有嵌套循环.这是从O(n ^ 2)到线性时间真正需要的唯一更改.这样做的缺点是需要完成更多的设置工作.因此,直到循环计数足够大以支付设置费用时,才能看到线性时间的优势.
  2. 使用ArrayList代替本地数组.将项目添加到本机数组需要重新分配该数组并复制所有项目.因此这也是O(n ^ 2)运算.由于此操作是在引擎级别完成的,因此常数很小,因此直到很久以后它才真正起作用.
  3. 使用PsObject.Copy创建新对象.与其他两个相比,这是一个很小的优化,但对我来说却将运行时间缩短了一半.

-

function Get-Hash{
    param(
        [Parameter(Mandatory=$true)]
        [object]$InputObject,
        [Parameter()]
        [string[]]$Properties    
    )

    $arr = [System.Collections.ArrayList]::new()

    foreach($p in $Properties) { $arr += $InputObject.$($p) }

    return ( $arr -join ':' )
}

function Merge-Objects{
    param(
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset1,
        [Parameter(Mandatory=$true)]
        [object[]]$Dataset2,
        [Parameter()]
        [string[]]$Properties
    )

    $results = [System.Collections.ArrayList]::new()

    $ds1props = $Dataset1 | gm -MemberType Properties
    $ds2props = $Dataset2 | gm -MemberType Properties
    $ds1propsNotInDs2Props = $ds1props | ? { $_.Name -notin ($ds2props | Select -ExpandProperty Name) }
    $ds2propsNotInDs1Props = $ds2props | ? { $_.Name -notin ($ds1props | Select -ExpandProperty Name) }


    $hash = @{}
    $Dataset2 | % { $hash.Add( (Get-Hash $_ $Properties), $_) }

    foreach ($row in $dataset1) {

        $key = Get-Hash $row $Properties

        $tempObject = $row.PSObject.Copy()

        if ($hash.containskey($key)) {
            $r2 = $hash[$key]

            $hash.remove($key)
            $ds2propsNotInDs1Props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $r2.$($_.Name)
            }

        } else {
            $ds2propsNotInDs1Props | % {
                $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
            }
        }
        [void]$results.Add($tempObject)
    }

    foreach ($row in $hash.values ) {
        # add missing dataset2 objects and extend
        $tempObject = $row.PSObject.Copy()

        $ds1propsNotInDs2Props | % {
            $tempObject | Add-Member -MemberType $_.MemberType -Name $_.Name -Value $null
        }

        [void]$results.Add($tempObject)
    }

    $results
}

########

$dsLength = 10000
$dataset1 = 0..$dsLength | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; XY = "foo$_"; ZY ="bar$_" }
}
$dataset2 = ($dsLength/2)..($dsLength*1.5) | %{
    New-Object psobject -Property @{ A=$_ ; B="val$_" ; ABC = "foo$_"; GH ="bar$_" }
}

Measure-Command -Expression {

    $data = Merge-Objects -Dataset1 $dataset1 -Dataset2 $dataset2 -Properties "A","B" 

}

这篇关于有效地合并具有多个匹配键的大型对象数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆