在 PowerShell 中查找数据集的统计模式 [英] Find the statistical mode(s) of a dataset in PowerShell

查看:51
本文介绍了在 PowerShell 中查找数据集的统计模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个自我回答的问题是这个问题的后续:

This self-answered question is a follow-up to this question:

如何确定给定数据集的(数组)统计模式,即最常出现的一个值或一组值?

How can I determine a given dataset's (array's) statistical mode, i.e. the one value or the set of values that occur most frequently?

例如,在数组1,2,2,3,4,4,5中有两种模式,24,因为它们是最常出现的值.

For instance, in array 1, 2, 2, 3, 4, 4, 5 there are two modes, 2 and 4, because they are the values occurring most frequently.

推荐答案

使用 Group-ObjectSort-Objectdo 的组合... while 循环:

Use a combination of Group-Object, Sort-Object, and a do ... while loop:

# Sample dataset.
$dataset = 1, 2, 2, 3, 4, 4, 5

# Group the same numbers and sort the groups by member count, highest counts first.
$groups = $dataset | Group-Object | Sort-Object Count -Descending

# Output only the numbers represented by those groups that have 
# the highest member count.
$i = 0
do { $groups[$i].Group[0] } while ($groups[++$i].Count -eq $groups[0].Count)

上面产生了 24,这是两种模式(出现最频繁的值,在这种情况下各两次),按升序排序(因为 Group-Object 按分组标准排序,Sort-Object 的排序算法稳定).

The above yields 2 and 4, which are the two modes (values occurring most frequently, twice each in this case), sorted in ascending order (because Group-Object sorts by the grouping criterion and Sort-Object's sorting algorithm is stable).

注意:虽然这个解决方案在概念上很简单,但大型数据集的性能可能是一个问题;请参阅底部部分,了解对某些输入可能进行的优化.

Note: While this solution is conceptually straightforward, performance with large datasets may be a concern; see the bottom section for an optimization that is possible for certain inputs.

说明:

  • Group-Object groups all inputs by equality.

排序-Object -Descending 以降序方式按成员计数对结果组进行排序(最常出现的输入在前).

Sort-Object -Descending sorts the resulting groups by member count in descending fashion (most frequently occurring inputs first).

do ... while 语句循环遍历已排序的组并输出每个组代表的输入,因此出现次数(频率)最高,正如第一组的成员数所暗示的那样.

The do ... while statement loops over the sorted groups and outputs the input represented by each as long as the group-member and therefore occurrence count (frequency) is the highest, as implied by the first group's member count.

性能更好的解决方案,包含字符串和数字:

如果输入元素是统一的简单数字或字符串(而不是复杂对象),则可以进行优化:

If the input elements are uniformly simple numbers or strings (as opposed to complex objects), an optimization is possible:

  • Group-Object-NoElement 禁止收集每个组中的单个输入.

  • Group-Object's -NoElement suppresses collecting the individual inputs in each group.

每个组的 .Name 属性反映了分组值,但作为 字符串 这样做,因此必须将其转换回其原始数据类型.

Each group's .Name property reflects the grouping value, but does so as a string, so it must be converted back to its original data type.

# Sample dataset.
# Must be composed of all numbers or strings.
$dataset = 1, 2, 2, 3, 4, 4, 5

# Determine the data type of the elements of the dataset via its first element.
# All elements are assumed to be of the same type.
$type = $dataset[0].GetType()

# Group the same numbers and sort the groups by member count, highest counts first.
$groups = $dataset | Group-Object -NoElement | Sort-Object Count -Descending

# Output only the numbers represented by those groups that have 
# the highest member count.
# -as $type converts the .Name string value back to the original type.
$i = 0
do { $groups[$i].Name -as $type } while ($groups[++$i].Count -eq $groups[0].Count)

这篇关于在 PowerShell 中查找数据集的统计模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆