在 PowerShell 中查找数据集的统计模式 [英] Find the statistical mode(s) of a dataset in PowerShell
问题描述
这个自我回答的问题是这个问题的后续:
This self-answered question is a follow-up to this question:
如何确定给定数据集的(数组)统计模式,即最常出现的一个值或一组值?
How can I determine a given dataset's (array's) statistical mode, i.e. the one value or the set of values that occur most frequently?
例如,在数组1,2,2,3,4,4,5
中有两种模式,2
和4
,因为它们是最常出现的值.
For instance, in array 1, 2, 2, 3, 4, 4, 5
there are two modes, 2
and 4
, because they are the values occurring most frequently.
推荐答案
使用 Group-Object
、Sort-Object
和 do 的组合... while
循环:
Use a combination of Group-Object
, Sort-Object
, and a do ... while
loop:
# Sample dataset.
$dataset = 1, 2, 2, 3, 4, 4, 5
# Group the same numbers and sort the groups by member count, highest counts first.
$groups = $dataset | Group-Object | Sort-Object Count -Descending
# Output only the numbers represented by those groups that have
# the highest member count.
$i = 0
do { $groups[$i].Group[0] } while ($groups[++$i].Count -eq $groups[0].Count)
上面产生了 2
和 4
,这是两种模式(出现最频繁的值,在这种情况下各两次),按升序排序(因为 Group-Object
按分组标准排序,Sort-Object
的排序算法稳定).
The above yields 2
and 4
, which are the two modes (values occurring most frequently, twice each in this case), sorted in ascending order (because Group-Object
sorts by the grouping criterion and Sort-Object
's sorting algorithm is stable).
注意:虽然这个解决方案在概念上很简单,但大型数据集的性能可能是一个问题;请参阅底部部分,了解对某些输入可能进行的优化.
Note: While this solution is conceptually straightforward, performance with large datasets may be a concern; see the bottom section for an optimization that is possible for certain inputs.
说明:
组-Object
按相等对所有输入进行分组.
Group-Object
groups all inputs by equality.
排序-Object -Descending
以降序方式按成员计数对结果组进行排序(最常出现的输入在前).
Sort-Object -Descending
sorts the resulting groups by member count in descending fashion (most frequently occurring inputs first).
do ... while
语句循环遍历已排序的组并输出每个组代表的输入,因此出现次数(频率)最高,正如第一组的成员数所暗示的那样.
The do ... while
statement loops over the sorted groups and outputs the input represented by each as long as the group-member and therefore occurrence count (frequency) is the highest, as implied by the first group's member count.
性能更好的解决方案,包含字符串和数字:
如果输入元素是统一的简单数字或字符串(而不是复杂对象),则可以进行优化:
If the input elements are uniformly simple numbers or strings (as opposed to complex objects), an optimization is possible:
Group-Object
的-NoElement
禁止收集每个组中的单个输入.
Group-Object
's-NoElement
suppresses collecting the individual inputs in each group.
每个组的 .Name
属性反映了分组值,但作为 字符串 这样做,因此必须将其转换回其原始数据类型.
Each group's .Name
property reflects the grouping value, but does so as a string, so it must be converted back to its original data type.
# Sample dataset.
# Must be composed of all numbers or strings.
$dataset = 1, 2, 2, 3, 4, 4, 5
# Determine the data type of the elements of the dataset via its first element.
# All elements are assumed to be of the same type.
$type = $dataset[0].GetType()
# Group the same numbers and sort the groups by member count, highest counts first.
$groups = $dataset | Group-Object -NoElement | Sort-Object Count -Descending
# Output only the numbers represented by those groups that have
# the highest member count.
# -as $type converts the .Name string value back to the original type.
$i = 0
do { $groups[$i].Name -as $type } while ($groups[++$i].Count -eq $groups[0].Count)
这篇关于在 PowerShell 中查找数据集的统计模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!