从数组的属性中获取唯一索引项的最快方法 [英] Fastest Way to get a uniquely index item from the property of an array

查看:50
本文介绍了从数组的属性中获取唯一索引项的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

制作一个这样的数组来代表我正在寻找的内容:

Make an array like this which represents what I'm looking for:

$array = @(1..50000).foreach{[PSCustomObject]@{Index=$PSItem;Property1='Hello!';Property2=(Get-Random)}}

获取索引属性为43122"的项目的最快方法是什么?

What's the fastest way to get the item with Index property '43122'?

我有一些想法,但我觉得必须有一个更快的方法:

Some ideas I had but I feel like there must be a quicker way:

measure-command {$array | where-object index -eq 43122} | % totalmilliseconds
420.3766

哪里方法

measure-command {$array.where{$_ -eq 43122}} | % totalmilliseconds
155.1342

先做一个hashtable,查询index"结果.一开始很慢,但随后的查找速度更快.

measure-command {$ht = @{};$array.foreach{$ht[$PSItem.index] = $psitem}} | % totalmilliseconds
124.0821

measure-command {$ht.43122} | % totalmilliseconds
3.4076

有没有比先构建哈希表更快的方法?也许是一种不同的 .NET 数组类型,比如某种特殊类型的索引列表,我可以最初将其存储在其中,然后运行一种方法以根据唯一属性提取项目?

Is there a faster way than building a hashtable first? Maybe a different .NET array type like some special kind of indexed list that I can store it in initially and then run a method to pull out the item based on the unique property?

推荐答案

部分归功于 PowerShell 能够调用 .Net 方法,它提供了一些过滤对象的安静的可能性.在 stackoverflow,您会发现很多 (PowerShell) 问题和答案,这些问题和答案用于衡量特定解离命令或 cmdlet.这通常会留下错误的印象,因为完整 (PowerShell) 解决方案的性能应该优于其各部分的总和.每个命令都取决于预期的输入和 - 输出.特别是在使用 PowerShell 管道时,命令 (cmdlet) 会与之前的命令和后面的命令交互.因此,重要的是要放眼大局,了解每个命令如何以及在何处获得其性能.
这意味着我无法确定您应该选择哪个命令,但是通过更好地理解下面列出的命令和概念,我希望您能够更好地为您的特定解决方案找到最快方式".

Partly thanks to the fact that PowerShell is able to invoke .Net methods, it offers quiet some possibilities to filter objects. At stackoverflow you will find a lot of (PowerShell) questions and answers measuring the performance of a specific extricated command or cmdlet. This usually leaves wrong impression as the performance of a complete (PowerShell) solution is supposed to be better than the sum of its parts. Each command is depended on the expected input and - output. Especially when using the PowerShell pipeline, commands (cmdlets) interact with prior commands and commands that follow. Therefore it is important to look at the bigger picture and understand how and where each command gains its performance.
This means that I can't tell which command you should choose, but with a better understanding of the commands and concepts listed below, I hope you are better able to find "fastest way" for your specific solution.

语言集成查询 (LINQ) 通常(不)有资格作为在 PowerShell 中过滤对象的快速解决方案(另请参阅 具有 LINQ 的高性能 PowerShell):

Language Integrated Query (LINQ) is often (dis)qualified as the fasted solution to filter objects in PowerShell (see also High Performance PowerShell with LINQ):

(Measure-Command {
    $Result = [Linq.Enumerable]::Where($array, [Func[object,bool]] { param($Item); return $Item.Index -eq 43122 })
}).totalmilliseconds
4.0715

刚刚超过 4ms!,其他任何方法都无法击败...
但在得出 LINQ 以 100 倍或更多的系数击败任何其他方法的结论之前,您应该记住以下几点.仅查看活动本身的性能时,在衡量 LINQ 查询的性能时有两个陷阱:

Just over 4ms!, none of the other methods can ever beat that...
But before jumping into any conclusions that LINQ beats any other method by a factor 100 or more you should keep the following in mind. There are two pitfalls in measuring the performance of a LINQ query when you just look at the performance of the activity itself:

  • LINQ 有一个很大的缓存,这意味着您应该重新启动一个新的 PowerShell 会话来衡量实际结果(或者如果您经常想重用查询,则不这样做).重新启动 PowerShell 会话后,您会发现启动 LINQ 查询所需的时间增加了大约 6 倍.
  • 但更重要的是,LINQ 执行懒惰评估(也称为延迟执行).这意味着除了定义应该做什么之外,实际上还没有做任何事情.这实际上表明您是否想要访问 $Result:
  • LINQ has a big cache, meaning that you should restart a new PowerShell session to measure the actual results (or just not, if you often want to reuse the query). After restarting the PowerShell session, you will find that it will take about 6 times longer to initiate the LINQ query.
  • But more importantly, LINQ performs lazy evaluation (also called deferred execution). This means that actually nothing has been done yet other than defining what should be done. This actually shows if you want to access one of the properties of the $Result:

(Measure-Command {
    $Result.Property1
}).totalmilliseconds
532.366

通常需要 15ms 来检索单个对象的属性:

Where it usually takes about 15ms to retrieve a property of a single object:

$Item = [PSCustomObject]@{Index=1; Property1='Hello!'; Property2=(Get-Random)}
(Measure-Command {
    $Item.Property1
}).totalmilliseconds
15.3708

最重要的是,您需要实例化结果以正确测量 LINQ 查询的性能(为此,让我们只检索测量中返回对象的属性之一):

Bottom line, you need to instantiate the results to correctly measure the performance of a LINQ query (for this, let's just retrieve one of the properties of the returned object within the measurement):

(Measure-Command {
    $Result = ([Linq.Enumerable]::Where($array, [Func[object,bool]] { param($Item); return $Item.Index -eq 43122 })).Property1
}).totalmilliseconds
570.5087

(仍然很快.)

哈希表通常很快,因为它们基于二元搜索算法,这意味着您最多必须猜测 ln 50000/ln 2 = 16 次 才能找到您的对象.然而,为单次查找构建 HashTabe 有点过头了.但是如果你控制对象列表的构造,你可能会在旅途中构造哈希表:

Hash tables are generally fast because they are based on a binary search algorithm, this means that you maximal have to guess ln 50000 / ln 2 = 16 times to find your object. Nevertheless, building a HashTabe for a single lookup is a little over done. But if you control the contruction of the object list, you might construct the hash table on the go:

(Measure-Command {
    $ht = @{}
    $array = @(1..50000).foreach{$ht[$PSItem] = [PSCustomObject]@{Index=$PSItem;Property1='Hello!';Property2=(Get-Random)}}
    $ht.43122
}).totalmilliseconds
3415.1196

对比:

(Measure-Command {
    $array = @(1..50000).foreach{[PSCustomObject]@{Index=$PSItem;Property1='Hello!';Property2=(Get-Random)}}
    $ht = @{}; $array.foreach{$ht[$PSItem.index] = $psitem}
    $ht.43122
}).totalmilliseconds
3969.6451

Where-Object cmdletWhere 方法

您可能已经得出结论,Where 方法出现大约是 Where-Object cmdlet 的两倍:

Where-Object cmdlet vs Where method

As you might already have concluded yourself the Where method appears about twice as fast then the Where-Object cmdlet:

Where-Object cmdlet:

(Measure-Command {
    $Result = $Array | Where-Object index -eq 43122
}).totalmilliseconds
721.545

Where 方法:

(Measure-Command {
    $Result = $Array.Where{$_ -eq 43122}
}).totalmilliseconds
319.0967

这是因为 Where 命令要求您将整个数组加载到内存中,而 Where-Object cmdlet 实际上不需要.如果数据已经在内存中(例如通过将其分配给变量 $array = ...),这不是什么大问题,但这本身可能实际上是一个缺点:除了它消耗内存, 必须等到所有对象都收到后才能开始过滤...

The reason for that is because the Where command requires you load the whole array into memory which is actually not required for the Where-Object cmdlet. If the data is already in memory (e.g. by assigning it to a variable $array = ...) it is not be a big deal but this might actually a disadvantage by itself: except that it consumes memory, you have to wait until all objects are received before you can start filtering...

不要低估像 Where-Object 这样的 PowerShell cmdlet 的功能,尤其是将解决方案作为一个整体与管道相结合.如上所示,如果您只衡量特定操作,您可能会发现这些 cmdlet 很慢,但如果您衡量整个端到端解决方案,您可能会发现没有太大区别,而且 cmdlet 甚至可能优于其他技术方法.在 LINQ 查询非常被动的情况下,PowerShell cmdlet 非常主动.
一般来说,如果您的输入尚未在内存中并通过管道提供,您应该尝试继续在该管道上构建并通过避免变量赋值($array = ...) 和括号的使用 ((...)) :

Don't underestimate the power of the PowerShell cmdlets like Where-Object especially look to the solution as a whole in combination with the pipeline. As shown above, if you just measure on the specific action you might find these cmdlets slow but if you measure your whole end-to-end solution you might find that there isn't much difference and that cmdlets might even outperform methods other techniques. Where LINQ queries are extremely reactive, PowerShell cmdlets are extremely proactive.
In general, if your input is not yet in memory and supplied via the pipeline, you should try to continue to build on that pipeline and avoid stalling it in any way by avoiding variables assignments ($array = ...) and the use of brackets ((...)) :

假设您的对象来自较慢的输入,在这种情况下,所有其他解决方案都需要等待最后一个对象能够开始过滤,其中 Where-Object 已经过滤了大部分动态对象一旦找到,就会不确定地传递给下一个 cmdlet...

Presume that your objects come from a slower input, in that case all the other solutions need to wait for the very last object to be able start filtering where the Where-Object has already filtered most of the objects on the fly and as soon it has found it, is indeterminately passed to the next cmdlet...

例如,假设数据来自 csv 文件而不是内存...

For example let's presume that the data comes from a csv file rather then memory...

$Array | Export-Csv .\Test.csv

Where-Object cmdlet:

(Measure-Command {
    Import-Csv -Path .\Test.csv | Where-Object index -eq 43122 | Export-Csv -Path .\Result.csv
}).totalmilliseconds
717.8306

Where 方法:

(Measure-Command {
    $Array = Import-Csv -Path .\Test.csv
    Export-Csv -Path .\Result.csv -InputObject $Array.Where{$_ -eq 43122}
}).totalmilliseconds
747.3657

这只是一个测试示例,但在大多数情况下数据不能立即在内存中可用Where-Object 流似乎通常比使用 Where 更快方法.
此外,Where 方法使用更多内存,如果您的文件(对象列表)大小超过可用物理内存,这可能会使性能更差.(另请参阅:能否在 PowerShell 中简化以下嵌套的 foreach 循环?).

This is just a single test example, but in most cases where the data isn't instantly available in memory, Where-Object streaming appears to be a often faster then using the Where method.
Besides, the Where method uses a lot more memory which might make performance even worse if your file (list of objects) size exceeds the available physical memory. (See also: Can the following Nested foreach loop be simplified in PowerShell?).

不使用 Where-Object cmdlet 或 Where 方法,您可以考虑遍历所有对象并将它们与 If 语句.在深入探讨这种方法之前,值得一提的是 比较运算符 已经自己遍历了左参数,引用:

Instead of using the Where-Object cmdlet or the Where method, you might consider to iterate through all the objects and just compare them with an If statement. Before going into depth of this approach it is worth mentioning that comparison operators already iterate through the left argument by itself, quote:

当运算符的输入是标量值时,比较运算符返回一个布尔值.当输入是一组值时,比较运算符返回任何匹配的值.如果没有集合中的匹配项,比较运算符返回一个空数组.

When the input to an operator is a scalar value, comparison operators return a Boolean value. When the input is a collection of values, the comparison operators return any matching values. If there are no matches in a collection, comparison operators return an empty array.

这意味着如果你只想知道具有特定属性的对象是否存在而不关心对象本身,你可能只是简单地比较特定的属性集合:

This means that if you just want to know whether the object with the specific property exists and don't care about the object itself, you might just simply compare the specific property collection:

(Measure-Command {
    If ($Array.Index -eq 43122) {'Found object with the specific property value'}
}).totalmilliseconds
55.3483

对于 ForEach-Object cmdlet 和 ForEach 方法,您会发现该方法比使用它们的对应方法(Where-Object cmdlet 和 Where 方法),因为嵌入比较会有更多的开销:

For the ForEach-Object cmdlet and the ForEach method, you will see that the approach just takes a little longer then using their counterparts (Where-Object cmdlet and the Where method) as there is a little more overhead for the embedded comparison:

直接从记忆中提取:
ForEach-Object cmdlet:

(Measure-Command {
    $Result = $Array | ForEach-Object {If ($_.index -eq 43122) {$_}}
}).totalmilliseconds
1031.1599

ForEach 方法:

(Measure-Command {
    $Result = $Array.ForEach{If ($_.index -eq 43122) {$_}}
}).totalmilliseconds
781.6769

从磁盘流式传输:
ForEach-Object cmdlet:

(Measure-Command {
    Import-Csv -Path .\Test.csv |
    ForEach-Object {If ($_.index -eq 43122) {$_}} |
    Export-Csv -Path .\Result.csv
}).totalmilliseconds
1978.4703

ForEach 方法:

(Measure-Command {
    $Array = Import-Csv -Path .\Test.csv
    Export-Csv -Path .\Result.csv -InputObject $Array.ForEach{If ($_.index -eq 43122) {$_}}
}).totalmilliseconds
1447.3628

ForEach 命令但即使使用嵌入式比较,ForEach command 看起来与使用 Where 方法的性能接近,而 $Array 已在内存中可用:

ForEach command But even with the embeded comparison, the ForEach command appears close to the performance of using the Where method when the $Array is already available in memory:

直接从记忆中提取:

(Measure-Command {
    $Result = $Null
    ForEach ($Item in $Array) {
        If ($Item.index -eq 43122) {$Result = $Item}
    }
}).totalmilliseconds
382.6731

从磁盘流式传输:

(Measure-Command {
    $Result = $Null
    $Array = Import-Csv -Path .\Test.csv
    ForEach ($Item in $Array) {
        If ($item.index -eq 43122) {$Result = $Item}
    }
    Export-Csv -Path .\Result.csv -InputObject $Result
}).totalmilliseconds
1078.3495

但如果您只查找一个(或第一个)出现,使用 ForEach 命令可能还有另一个优势:您可以Break 找到对象后退出循环并简单地跳过数组迭代的其余部分.换句话说,如果该项目出现在最后,可能没有太大区别,但如果它出现在开头,您将赢得很多.为了平衡这一点,我采用了平均指数 (25000) 进行搜索:

But there might be another advantage of using the ForEach command if you only looking for one (or the first) occurrence: you can Break out of the loop once you have found the object and with that simply skip the rest of the array iteration. In other words, if the item appears at the end, there might not be much of a difference but if it appears at the beginning you have a lot to win. To level this, I have taken the average index (25000) for the search:

(Measure-Command {
    $Result = $Null
    ForEach ($Item in $Array) {
        If ($item.index -eq 25000) {$Result = $Item; Break}
    }
}).totalmilliseconds
138.029

请注意,您不能Break 语句用于 ForEach-Object cmdlet 和 ForEach 方法,请参阅:如何在 PowerShell 中退出 ForEach-Object

Note that you can't use the Break statement for the ForEach-Object cmdlet and ForEach method, see: How to exit from ForEach-Object in PowerShell

纯粹查看测试的命令并做出一些假设,例如:

Purely looking at the tested commands and making a few assumptions like:

  • 输入不是瓶颈($Array 已经驻留在内存中)
  • 输出不是瓶颈($Result 没有实际使用)
  • 您只需要一次(第一次)出现
  • 在迭代之前、之后和之中没有其他事情要做
  • The input isn't a bottleneck (the $Array is already resident in memory)
  • The output isn't a bottleneck (the $Result isn't actually used)
  • You only need one (the first) occurrence
  • There is nothing else to do prior, after and within the iteration

使用 ForEach command 并简单地比较每个索引属性直到找到对象,这似乎是给定/假设了这个问题的界限,但如开头所述;要确定什么对您的用例最快,您应该了解自己在做什么,并查看整个解决方案,而不仅仅是一部分.

Using the ForEach command and simply comparing each index property until you find the object, appears to be the fastest way in the given/assumed boundaries of this question but as stated at the beginning; to determine what is fastest for your used case, you should understand what you doing and look at the whole solution and not just a part.

这篇关于从数组的属性中获取唯一索引项的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆