哪个运算符为大型CSV文件提供更快的输出-match -contains或Where-Object [英] Which operator provides quicker output -match -contains or Where-Object for large CSV files

查看:62
本文介绍了哪个运算符为大型CSV文件提供更快的输出-match -contains或Where-Object的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试建立一种逻辑,其中我必须针对1个CSV文件查询4个大型CSV文件.特别是针对4个域找到一个AD对象,并将它们存储在变量中以进行属性比较.

I am trying to build a logic where I have to query 4 large CSV files against 1 CSV file. Particularly finding an AD object against 4 domains and store them in variable for attribute comparison.

我尝试导入所有文件中的不同变量,并使用以下3种不同的代码来获得所需的输出.但是完成所需的时间比预期的要长.

I have tried importing all files in different variables and used below 3 different codes to get the desired output. But it takes longer time for completion than expected.

CSV导入:

$AllMainFile = Import-csv c:\AllData.csv
#Input file contains below
EmployeeNumber,Name,Domain
Z001,ABC,Test.com
Z002,DEF,Test.com
Z003,GHI,Test1.com
Z001,ABC,Test2.com


$AAA = Import-csv c:\AAA.csv
#Input file contains below
EmployeeNumber,Name,Domain
Z001,ABC,Test.com
Z002,DEF,Test.com
Z003,GHI,Test1.com
Z001,ABC,Test2.com
Z004,JKL,Test.com

$BBB = Import-Csv C:\BBB.csv
$CCC = Import-Csv C:\CCC.csv
$DDD = Import-Csv c:\DDD.csv

示例代码1:

foreach ($x in $AllMainFile) {
    $AAAoutput += $AAA | ? {$_.employeeNumber -eq $x.employeeNumber}
    $BBBoutput += $BBB | ? {$_.employeeNumber -eq $x.employeeNumber}
    $CCCoutput += $CCC | ? {$_.employeeNumber -eq $x.employeeNumber}
    $DDDoutput += $DDD | ? {$_.employeeNumber -eq $x.employeeNumber}

    if ($DDDoutput.Count -le 1 -and $AAAoutput.Count -le 1 -and $BBBoutput.Count -le 1 -and $CCCoutput.Count -le 1) {
        #### My Other script execution code here
    } else {
        #### My Other script execution code here
    }
}

示例代码2(只需替换为 -match 而不是 Where-Object ):

Sample code 2 (just replacing with -match instead of Where-Object):

foreach ($x in $AllMainFile) {
    $AAAoutput += $AAA -match $x.EmployeeNumber
    $BBBoutput += $BBB -match $x.EmployeeNumber
    $CCCoutput += $CCC -match $x.EmployeeNumber
    $DDDoutput += $AllMainFile -match $x.EmployeeNumber

    if ($DDDoutput.Count -le 1 -and $AAAoutput.Count -le 1 -and $BBBoutput.Count -le 1 -and $CCCoutput.Count -le 1) {
        #### My Other script execution code here
    } else {
        #### My Other script execution code here
    }
}

示例代码3(只需替换为 -contains 运算符):

Sample code 3 (just replacing with -contains operator):

foreach ($x in $AllMainFile) {
    foreach ($c in $AAA){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$AAAoutput += $c}}
    foreach ($c in $BBB){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$BBBoutput += $c}}
    foreach ($c in $CCC){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$CCCoutput += $c}}
    foreach ($c in $DDD){ if ($AllMainFile.employeeNumber -contains $c.employeeNumber) {$DDDoutput += $c}}

    if ($DDDoutput.Count -le 1 -and $AAAoutput.Count -le 1 -and $BBBoutput.Count -le 1 -and $CCCoutput.Count -le 1) {
        #### My Other script execution code here
    } else {
        #### My Other script execution code here
    }
}

我希望通过将所有4个CSV文件与1个输入文件进行比较和查找来尽可能快地执行脚本.每个文件包含超过1000k的对象/行以及5列.

I am expecting to execute the script as quick and fast as possible by comparing and lookup all 4 CSV files against 1 input file. Each files contains more than 1000k objects/rows with 5 columns.

推荐答案

性能

在回答问题之前,我想澄清一下有关衡量PowerShell cmdlet性能的信息.本机PowerShell在流对象方面非常出色,因此如果正确进行流传输,则可以节省大量内存(请勿将流分配给变量或使用方括号).PowerShell也能够调用几乎所有现有的 .Net 方法(如 Add())和类似 LINQ 之类的技术.

Performance

Before answering the question, I would like to clear some air about measuring the performance of PowerShell cmdlets. Native PowerShell is very good in streaming objects and therefore could save a lot of memory if streamed correctly (do not assign a stream to a variable or use brackets). PowerShell is also capable of invoking almost every existing .Net methods (like Add()) and technologies like LINQ.

衡量命令性能的常用方法是:

The usual way of measuring the performance of a command is:

(Measure-Command {<myCommand>}).TotalMilliseconds

如果在本机Powershell流式cmdlet上使用此命令,则它们与语句和dotnet命令相比似乎表现不佳.通常可以得出这样的结论:LINQ优于本机PowerShell命令的程度要高出一百倍.这样做的原因是LINQ是反应性的,并且使用了延迟的(延迟)执行:告诉它已经完成了工作,但是实际上在需要任何结果的时候就正在执行(除了缓存很多结果,这是最简单的)通过开始新的会话将其排除在基准测试之外)Native PowerShell的前摄性很强:它将任何已解决的项目立即传递回管道中,任何下一个cmdlet(例如 Export-Csv )都可能比最终完成项目并从内存中释放.
换句话说,如果输入速度较慢(请参阅:使用本机PowerShell )或要处理大量数据(例如大于可用的物理内存),使用本机PowerShell方法可能会更好,更容易.
无论如何,如果要比较任何结果,都应该在实践中进行测试,并进行端到端测试,而不仅仅是在内存中已经可用的数据上进行测试.

If you use this on native powershell streaming cmdlets, they appear not to perform very well in comparison with statements and dotnet commands. Often it is concluded that e.g. LINQ outperforms native PowerShell commands well over a factor hundred. The reason for this is that LINQ is reactive and using a deferred (lazy) execution: It tells it has done the job but it is actually doing it at the moment you need any result (besides it is caching a lot of results which is easiest to exclude from a benchmark by starting a new session) where of Native PowerShell is rather proactive: it passes any resolved item immediately back into the pipeline and any next cmdlet (e.g. Export-Csv) might than finalize the item and release it from memory.
In other words, if you have a slow input (see: Advocating native PowerShell) or have a large amount data to process (e.g. larger than the physical memory available), it might be better and easier to use the Native PowerShell approach.
Anyways, if you are comparing any results, you should test is in practice and test it end-to-end and not just on data that is already available in memory.

我同意在列表上使用 Add()方法比使用 + = 将列表中的新项目与当前数组连接在一起,然后使用将其重新分配.
但是同样,这两种方法都会使管道停滞不前,因为它们会收集内存中的所有数据,您可能会更好地将结果中间释放到磁盘上.

I agree that using the Add() method on a list is much faster that using += which concatenates the new item with the current array and then reassigns it back to the array.
But again, both approaches stall the pipeline as they collect all the data in memory where you might be better off to intermediately release the result to the disk.

由于为二进制搜索进行了优化,因此您可能会发现使用哈希表可以最大程度地提高性能..
由于需要将两个集合进行相互比较,因此您不能同时流式传输两个数据集,但是正如所解释的那样,最好在一侧使用1个哈希表,然后在另一侧将其与流中的每个项目进行比较,这是最好和最简单的方法,因为您要比较每个其他表的 AllData ,所以最好将该表索引到内存中(以哈希表的形式).

You will probably find the most performance improvement in using a hash table as they are optimized for a binary search.
As it is required to compare two collections to each other, you can't stream both but as explained, it might be best and easiest you use 1 hash table for one side and compare this to each item in a stream at the other side and because you want to compare the AllData which each of the other tables, it is best to index that table into memory (in the form of a hash table).

这就是我要这样做的方式:

This is how I would do this:

$Main = @{}
ForEach ($Item in $All) {
    $Main[$Item.EmployeeNumber] = @{MainName = $Item.Name; MainDomain = $Item.Domain}
}

ForEach ($Name in 'AAA', 'BBB', 'CCC', 'DDD') {
    Import-Csv "C:\$Name.csv" | Where-Object {$Main.ContainsKey($_.EmployeeNumber)} | ForEach-Object {
        [PSCustomObject](@{EmployeeNumber = $_.EmployeeNumber; Name = $_.Name; Domain = $_.Domain} + $Main[$_.EmployeeNumber])
    } | Export-Csv "C:\Output$Name.csv"
}


附录

基于注释(以及列表中的重复项),似乎实际上是在请求所有键的联接,而不仅仅是在 EmployeeNumber 上.为此,您需要串联相关的键(用数据中使用的 not 分隔符分隔),并将其用作哈希表的键.
不在问题中,但从评论中可以看出,完全连接是可以预期的.对于右联接部分,可以通过在主表( $ Main.ContainsKey($ Key))中找不到匹配项的情况下返回右对象来完成.对于左连接部分,此操作更为复杂,因为您需要跟踪($ InnerMain)main中的哪些项已匹配,并最后返回剩余项:

Based on the comment (and the duplicates in the lists), it appears that actually a join on all keys is requested and not just on the EmployeeNumber. For this you need to concatenate the concerned keys (separated with a separator that is not used in the data) and use that as key for the hash table.
Not in the question but from the comment it appears also that full-join is expected. For the right-join part this can be done by returning the right object in case there is no match found in the main table ($Main.ContainsKey($Key)). For the left-join part this is more complex as you will need to track ($InnerMain) which items in main are already matched and return the leftover items in the end:

$Main = @{}
$Separator = "`t"                       # Chose a separator that isn't used in any value
ForEach ($Item in $All) {
    $Key = $Item.EmployeeNumber, $Item.Name, $Item.Domain -Join $Separator
    $Main[$Key] = @{MainEmployeeNumber = $Item.EmployeeNumber; MainName = $Item.Name; MainDomain = $Item.Domain}    # What output is expected?
}

ForEach ($Name in 'AAA', 'BBB', 'CCC', 'DDD') {
    $InnerMain = @($False) * $Main.Count
    $Index = 0
    Import-Csv "C:\$Name.csv" | ForEach-Object {
        $Key = $_.EmployeeNumber, $_.Name, $_.Domain -Join $Separator
        If ($Main.ContainsKey($Key)) {
            $InnerMain[$Index] = $True
            [PSCustomObject](@{EmployeeNumber = $_.EmployeeNumber; Name = $_.Name; Domain = $_.Domain} + $Main[$Key])
        } Else {
            [PSCustomObject](@{EmployeeNumber = $_.EmployeeNumber; Name = $_.Name; Domain = $_.Domain; MainEmployeeNumber = $Null; MainName = $Null; MainDomain = $Null})
        }
        $Index++
    } | Export-Csv "C:\Output$Name.csv"
    $Index = 0
    ForEach ($Item in $All) {
        If (!$InnerMain[$Index]) {
            $Key = $Item.EmployeeNumber, $Item.Name, $Item.Domain -Join $Separator
            [PSCustomObject](@{EmployeeNumber = $Null; Name = $Null; Domain = $Null} + $Main[$Key])
        }
        $Index++
    } | Export-Csv "C:\Output$Name.csv"
}


加入对象

仅供参考,我对 Join-Object cmdlet(使用和安装非常简单,请参见:在Powershell中,将两个表合并为一个的最佳方法是什么?),包括更轻松地更改多个联接,这可能对于请求这样的请求很方便.尽管我仍然对您要查找的内容不完全了解(并且有一些小问题,例如:如果域列是从一个特定域中提取的,那么域在域列中会有什么不同?).
我以一般描述"特别是针对4个域找到一个AD对象并将其存储在变量中以进行属性比较"作为引导.在这里,我假设 $ AllMainFile 实际上只是一个中间表,存在于所有相关表的串联中(并且不一定,但是只是令人困惑,因为它可能包含重复类型的同一域中的employeenumbers 以及其他域中的 employeenumbers ).如果正确,则可以使用 Join-Object cmdlet省略该表:


Join-Object

Just FYI, I have made a few improvements to Join-Object cmdlet (use and installation are very simple, see: In Powershell, what's the best way to join two tables into one?) including an easier changing of multiple joins which might come in handy for a request as this one. Although I still do not have full understanding of what you exactly looking for (and have minor questions like: how could the domains differ in a domain column if it is an extract from one specific domain?).
I take the general description "Particularly finding an AD object against 4 domains and store them in variable for attribute comparison" as leading. In here I presume that the $AllMainFile is actually just an intermediate table existing out of a concatenation of all concerned tables (and not really necessarily but just confusing as it might contain to types of duplicates the employeenumbers from the same domain and the employeenumbers from other domains). If this is correct, you can just omit this table using the Join-Object cmdlet:

$AAA = ConvertFrom-Csv @'
EmployeeNumber,Name,Domain
Z001,ABC,Domain1
Z002,DEF,Domain2
Z003,GHI,Domain3
'@

$BBB = ConvertFrom-Csv @'
EmployeeNumber,Name,Domain
Z001,ABC,Domain1
Z002,JKL,Domain2
Z004,MNO,Domain4
'@

$CCC = ConvertFrom-Csv @'
EmployeeNumber,Name,Domain
Z005,PQR,Domain2
Z001,ABC,Domain1
Z001,STU,Domain2
'@

$DDD = ConvertFrom-Csv @'
EmployeeNumber,Name,Domain
Z005,VWX,Domain4
Z006,XYZ,Domain1
Z001,ABC,Domain3
'@

$AAA | FullJoin $BBB -On EmployeeNumber -Discern AAA |
    FullJoin $CCC -On EmployeeNumber -Discern BBB |
    FullJoin $DDD -On EmployeeNumber -Discern CCC,DDD | Format-Table

结果:

EmployeeNumber AAAName AAADomain BBBName BBBDomain CCCName CCCDomain DDDName DDDDomain
-------------- ------- --------- ------- --------- ------- --------- ------- ---------
Z001           ABC     Domain1   ABC     Domain1   ABC     Domain1   ABC     Domain3
Z001           ABC     Domain1   ABC     Domain1   STU     Domain2   ABC     Domain3
Z002           DEF     Domain2   JKL     Domain2
Z003           GHI     Domain3
Z004                             MNO     Domain4
Z005                                               PQR     Domain2   VWX     Domain4
Z006                                                                 XYZ     Domain1

这篇关于哪个运算符为大型CSV文件提供更快的输出-match -contains或Where-Object的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆